A, what? According to?

Limiting traffic is one of the three most effective ways to protect high-concurrency systems, along with caching and service degradation.
By controlling the amount and rate of traffic sent to the network, you can adjust the data transfer rate and reduce congestion to protect yourself and downstream systems from high concurrent traffic.

Second, the strategy of limiting traffic

The purpose of limiting traffic is to limit the number of concurrent accesses by limiting the number of concurrent accesses. The associated strategy is usually to trigger a limiting behavior once a limit is reached (rate, or failure rate, or queue time, etc.). Generally speaking, the triggered epidemic limiting is as follows:

Denial of service

Reject the extra requests. Generally speaking, a good traffic limiting system will count which client is receiving the most requests and reject the client directly when receiving a traffic surge. This behavior can block out some abnormal or malicious high concurrent access.

Service degradation

Shut down or degrade back-end services. This allows the service to have enough resources to handle more requests. Downgrading can be done in many ways. One way is to stop a service that is not important, and to transfer CPU, memory, or data resources to a more important function. One is to return only partial data instead of full data. Because full data requires SQL Join operations and partial data does not, SQL execution can be made faster, and the fastest way is to return to the default cache directly, sacrificing consistency for greater performance.

Privilege request

A privilege request means that I run out of resources and can only allocate limited resources to important users, such as VIP users with higher power. In multi-tenant systems, large customers should be protected during traffic limiting, so large customers are privileged to be treated first, and other non-privileged users have to get out of the way.
Delay processing
In this case, there is usually a queue to buffer a large number of requests. If the queue is full, the user can only be rejected. If the task in the queue times out, the system is busy. Buffer queues are used only for stress relief and are generally used for short spikes.

Elastic scaling

Use automated operation and maintenance to automate the scaling of corresponding services. This requires an application performance monitoring system that is aware of the current TOP 5 busiest services. Then scaling them requires an operational system that automates distribution, deployment, and service registration, and the sooner the better. Otherwise, the system will be crushed. Of course, if the database is under too much pressure, elastic scaling application is not useful, this time should be limited.

Three, the implementation of current limiting

3.1 speed limiter

3.1.1 Static speed limiter

3.1.1.1 Leaky Bucket

This is a schematic diagram of Bucket algorithm, details to view the wiki entry [Leaky Bucket] (https://en.wikipedia.org/wiki/Leaky_bucket)

Suppose we have a bucket and we pour a bursty flow into it at a varying rate, but must draw water at a fixed flow. To do this, we will cut a hole in the bottom of the bucket. This will ensure that the water in the bucket flows out at a certain rate, and if the bucket is full, we will stop pouring water (denial of service or downgrade treatment).
The input rate can vary, but the output rate remains constant. In networks, “leaky bucket” technology eliminates burst traffic and protects the back end from avalanches in the event of traffic overload.
Generally, FIFO queues can be used to achieve the process as shown in the following figure:

3.1.1.2 Token Bucket

Details about the Token Bucket algorithm diagram is as follows, you can view the wiki entry [Token Bucket] (https://en.wikipedia.org/wiki/Token_bucket)

Tokens are put into a bucket at a certain rate. Then, when a handler wants to process a request, it needs to get the token before it can process it. If it is not available, it is not processed (denial of service or degraded).

3.1.1.3 Comparison of two classical algorithms

LEAKY BUCKET TOKEN BUCKET
When the host has to send a packet , packet is thrown in bucket. In this leaky bucket holds tokens generated at regular intervals of time.
Bucket leaks at constant rate Bucket has maximum capacity.
Bursty traffic is converted into uniform traffic by leaky bucket. If there is a ready packet , a token is removed from Bucket and packet is send.
In practice bucket is a finite queue outputs at finite rate If there is a no token in bucket, packet can not be send.
The Token bucket allows burst traffic to some extent, while the Leaky bucket is mainly used to ensure smooth traffic.
For more information, please see “Explanation of Guava RateLimiter’s Throttling” Mechanism] (https://www.alibabacloud.com/blog/detailed-explanation-of-guava-ratelimiters-throttling-mechanism_594820)”

3.1.1.4 Open Source limiter

“[Google Guava RateLimiter] (https://guava.dev/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html) based on T Oken Bucket algorithm for Java flow limiter
[Uber Go/Ratelimit](https://github.com/uber-go/ratelimit) Golang limiter based on the Leaky Bucket algorithm
“[golang.org/x/time/rate] (http://golang.org/x/time/rate) [Limiter] (https://godoc.org/golang.org/x/time/rate#Limiter) golang official Token bucket algorithm current Limiter

3.1.2 Dynamic limiter

Background of generation
Static speed limiting algorithm is easy to use, low cost and obvious effect.
Static means that the parameters in the Leaky bucket, Token bucket, and bucket capacity are written to death and cannot be dynamically detected or adjusted.
So how many of these parameters do we need to set when we use them?
This is a real headache. Generally, we do performance tests on the specified hardware configuration (CPU, memory, network bandwidth, etc.) to find the maximum TPS/QPS as a benchmark for parameter setting. Even so, in the distributed system, the environment is complex and changeable, for example, there is a problem with redis or DB, for example, cluster instance expansion or reduction, etc., the QPS measured by us is no longer reliable, and the traffic limiting strategy will also have great problems.
Look at gitbug corresponding problems on the [Possible to create a dynamic rate limiter?] (https://github.com/resilience4j/resilience4j/issues/174)”

How to solve these problems?

1) Manual modification of dynamic configuration — semi-automatic, parameters in the dynamic configuration center, when there is a specific situation, manual judgment modification
2) Dynamic Current Limiter — the ultimate solution for full automation (but complicated to implement)

Basic introduction

Core idea: dynamically perceive the pressure of the system, and automatically limit the current.
Based on system dynamic operation indicators, such as its own CPU usage and memory usage, timeout rate and abnormal rate of calling back-end real-time detection, it can dynamically estimate the QPS that itself or the back-end can bear in real time. When external conditions change, it can dynamically adjust and operate well without manual intervention.

Algorithm implementation

A. Initial status

Define the flow control window window with an initial value of CONFIDENCE (which can be obtained by pressure measurement or an appropriate random number), as follows:
window_(t) = confidence, where t = 0
Copy the code

Define a window upper limit threshold (not too far from the QPS obtained by pressure measurement), for example, 2 times confidence:
threshold_(t) = 2 * confidence, where t = 0
Copy the code

B. Window growth:

Window incremental: (on success, before reaching threshold) Window_ (t) = Window_ (T-1) + (α * success_weight(t))Copy the code

Slow incremental: Window_ (t) = Window_ (T-1) + (α * success_weight(t))/Window_ (T-1)Copy the code

Here is an example of Window Growth:

C. Reduce Window drop:

w_drop = array(confidence * 2, .....)
w_drop = append(w_drop, rps_drop)threshold_(t) = sum(w_drop) / len(w_drop)
Copy the code

Refer to Designing Simple & Dynamic Rate Limiter In Golang for more details.

The above algorithm leaves a key problem. How to define congestion and congestion recovery?
You can collect statistics on the timeout rate of a time window (for example, 3S). If the timeout rate exceeds 0.01%, congestion occurs; otherwise, congestion recovers (normal state). Is there a better way? The answer is to judge ahead of time by the value of the timeout, rather than waiting for the actual timeout — to improve sensitivity.
The statistics here are usually compared to CPU consumption.

Summary: The dynamic flow limiting algorithm based on flow control window is based on the congestion control algorithm of TCP protocol. The details of window increase and window increase and window decrease as well as the strategy can be defined by themselves. Some simple ones are directly increased or decreased by percentage on the benchmark, and some complex ones are defined by stages such as fast growth, slow growth and multiplication reduction described above.

3.2 a fuse

3.2.1 Fuse Introduction

In order to protect the service, when there is a problem in the downstream, such as the soaring error rate of calling the downstream service, the fuse will be triggered at this time, and the request to the downstream will be randomly discarded or even stopped, so as to prevent being dragged by the downstream and completely avalanche cannot start.
It is inspired by the “fuse” on our switches. When there is a problem with the voltage (such as a short circuit), it will automatically trip. At this time, the circuit will disconnect and our appliances will be protected. Otherwise, it can cause electrical appliances to burn out, and fires if people are not home or asleep. So, in the circuit world, there are often self-protective devices like this.

3.2.2 State machine fuse

Fuses can be thought of as proxies for operations that might fail. The agent should monitor the number of recent failures and then use this information to decide whether to allow the operation to proceed or simply return an exception. An agent can be implemented as a state machine with the following states that simulate the function of a circuit breaker:

Closed state:

We need a counter for failed calls, incrementing the number of failures by one if the call fails. If the number of recent failures exceeds the threshold for allowing failures within a given period of time, switch to the Open state. A timeout clock is enabled, and when the clock exceeds this time, the state switches to half-open.

Open state:

In this state, requests to the application immediately return an error response without invoking the service on the back end.

Half-open state:

Allow an application to invoke a service on a certain number of requests. If these requests make successful calls to the service, it is assumed that the error that caused the call to fail has been fixed, the fuse switches to the closed state, and the error counter is reset. If this number of requests fail, the problem that caused the previous call failure is considered to still exist, the fuse switches back to off, and the timer is reset to give the system time to correct the error. A semi-disconnected state is a great way to prevent a recovering service from being dragged down again by a sudden flood of requests.

More details you can refer to design the mode of fuse (v = pandp. 10) [https://docs.microsoft.com/en-us/previous-versions/msp-n-p/dn589784] (HTTP: / / https://docs.microsof t.com/en-us/previous-versions/msp-n-p/dn589784(v=pandp.10))

3.2.3 Open source implementation

Gobreaker: [https://github.com/sony/gobreaker] (https://github.com/sony/gobreaker)
Go – circuitbreaker: [https://github.com/mercari/go-circuitbreaker] (https://github.com/mercari/go-circuitbreaker)

3.3 Other overload judgment methods

Speed limiter is mainly based on QPS to determine whether overload, fuse is mainly by judging the failure number or failure rate of the downstream link to determine whether overload, we judge whether overload based on other methods collectively referred to as – other methods of overload judgment
Overload control is the behavior of a server discovering that it has insufficient processing capacity and discarding a certain number of requests randomly/or in accordance with certain rules in order to prevent deterioration or even avalanche. The judgment here is mainly based on their own processing ability, as follows:
Queue wait time, throughput, request processing latency, memory utilization, CPU utilization, etc.
Queue wait time = the time the request starts to be processed – the time the request arrives at the service.

Four, industry boundary flow introduction

4.1 wechat overload control

“The [Overload Control for Scaling WeChat Microservices] (https://arxiv.org/abs/1806.04075?from=timeline&isappinstalled=0)”
“[WeChat micro service overload control thesis summary] (https://bytedance.feishu.cn/docs/5FyZC9Ee0c34roxDey2Ybc)”
This paper introduces the design of DAGOR, an overload control system for account-oriented microservice architecture.
DAGOR has no sense of what’s inside the service. It manages overloads at the granularity of microservices so that the load status of each microservice can be monitored in real time and when overloads are detected, they can trigger load dropping in a collaborative manner. DAGOR has been used on wechat for five years. Experiments show that DAGOR can improve the success rate of service and ensure the fairness of overload control under the condition of system overload.
Its basic mechanism is that when a client request arrives, it is assigned a business priority and a user priority, according to which all its downstream services can force to accept or reject the request. Each microservice has its own priority threshold to determine whether or not to accept requests, and they all monitor their load status by checking system-level metrics, such as the average processing latency of requests. If a microservice detects an overload, the microservice tries to reduce the load by adjusting the priority threshold (the threshold has a tuning algorithm). At the same time, the microservice notifies its direct upstream service of the threshold change, allowing the client request to be rejected at an earlier level.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — collaborative overload control system

How to determine overload?

Use queue times to detect overloads. The average queuing time threshold of wechat is 20ms. If the queuing time exceeds 20ms, the queuing service is overloaded.

Processing policy after overload

Priority policy:

When a service is overloaded, its service denial mechanism discards lower-priority services, freeing resources for higher-priority services. The business priority of a user request (and the subsequent series of service requests in its invocation chain) is determined by the type of action to be performed in the entry service.
Business Priority + User Priority (automated algorithm)

The working process

1. When a user request arrives at the microservice system, it is routed to the relevant entry service. The entry service assigns business and user priorities to requests and inherits the same priorities encapsulated in the request header for all subsequent requests to downstream services.
2. Each service invokes one or more downstream services based on business logic. Service requests and responses are provided through messaging.
3. When the service receives a request, it performs priority-based access control based on the current access level. The service periodically adjusts its access levels based on load status. When a service wants to send subsequent requests to downstream services, it performs local access control based on the access level of the stored downstream services. Upstream services only make requests that are allowed by local access control.
4. When the downstream service returns the effect to the upstream service, it appends its current access level to the response message.
5. When the upstream service receives the response, it extracts the information about the access level from the message and updates the local record of the downstream service accordingly.

Five, the summary

1. Flow limiting is very important for high availability of distributed system. It can protect the system and avoid avalanche in case of sudden flow or system abnormality; Common means include: speed limiter, fuse and other overload detection control mechanisms.
2, the common speed limiter is mainly realized by static algorithm, including the leaky bucket algorithm and token bucket algorithm, simple, rough, effective, but in the use of parameter write dead, specific circumstances will be a problem; The dynamic algorithm can dynamically sense the pressure of the system and automatically limit the flow. The typical algorithm can be realized by referring to the congestion control algorithm of TCP. Although the algorithm is a little complicated, it has the advantages of good effect and automation, and is a kind of current limiting method with relatively high comprehensive evaluation.
3. The fusing mode is inspired by the circuit “fuse”, which is usually implemented by the state machine. It can protect itself in case of problems in the downstream link.
4, in addition to the speed limiter QPS, fusing failure rate detection, there are some other ways, such as judging the queuing time, CPU, memory, delay and throughput, industry experience surface, queuing time is more reliable.
5. After flow control occurs, common processing policies include denial of service or service degradation, which can distinguish KA users from common users and provide different processing policies.