preface

Station B just said it would crash!

It is a challenge to build a structure with high service quality under the peak of traffic. This article is the sharing and arrangement of “Yunjia Community Salon Online” by Mr. MAO Jian, technical director of STATION B, and elaborates some systematic usability design based on the system methodology of Google SRE and the actual business response process. For us to understand the overall picture of the system, upstream and downstream joint defense has further help.

1. Load balancing

Load balancing is divided into two directions, one is the front-end load balancing, the other is the load balancing inside the data center.

In terms of front-end load balancing, user traffic access is mainly based on DNS to minimize user request delay. The user traffic is optimally distributed on multiple network links, multiple data centers and multiple servers, and the minimum delay is achieved through the dynamic CDN scheme.

In the preceding figure, user traffic flows to the front-end access layer of the BFE. The BFE at layer 1 actually acts as a route and selects an equipment room close to the A-Node as far as possible to speed up user requests. Then, through the API gateway, it is forwarded to the downstream service layer, which may be some internal micro services or aggregation layer of business, etc., and finally constitutes a complete traffic pattern.

Based on this, the front-end server load balancing mainly considers the following logic:

  • First, try to select the nearest node;

  • Secondly, API is selected to enter the machine room based on bandwidth policy scheduling.

  • Third, balance traffic based on available service capacity.

In terms of load balancing within a data center, the difference in CPU consumption between the busiest and least busy nodes would ideally be small, as shown on the right of the figure above. But if the load balancing is not done right, the picture can be as different as the one on the left. As a result, it may be difficult to schedule and arrange resources, and container resources cannot be allocated properly.

Therefore, load balancing within a data center mainly considers:

  • Balanced traffic distribution;

  • Reliably identify abnormal nodes;

  • Scale-out, add homogeneous nodes for expansion;

  • Reduce errors and improve usability.

We found that the CPU usage of Intranet services was too high when we expanded the capacity through homogeneous nodes. After investigation, we found that the cost of health check between RPC point-to-point communication was too high, causing some problems. On the other hand, if the underlying service has only one set of clusters, the jitter of the underlying service will cause a large number of faults. Therefore, multiple clusters are required to solve the problem.

By implementing a subset connection from client to Backend, the backend can be evenly distributed to clients and node changes can be handled continuously to balance connections and avoid large changes. In the case of multiple clusters, the o&M cost of cluster migration needs to be considered. In addition, service data between clusters has a small overlap.

Back to the problem of heavy CPU usage during busy and idle hours, we will find that the load balancing algorithm is behind this problem.

The first problem is that the cost of each QPS, and in fact each query, query, and API request, is different. The difference between nodes is so great that even if you distribute traffic evenly, it’s actually not even from a load point of view.

The second problem is that there are differences in physical machine environments. Because we usually purchase servers by year, the newly purchased servers usually have stronger main frequency CPU, so it is difficult to achieve strong homogeneity in nature.

Based on this, referring to the problems caused by the JSQ (Least Idle rotation) load balancing algorithm, we found that the lack of global view of the server, so our goal needs to take load and availability into consideration. Referring to The idea of The Power of Two Choices in Randomized Load balancing, we used The choice-of-2 algorithm to score two randomly selected nodes and select The better node:

  • Backend: CPU, Client: Health, Inflight, and Latency are used as indicators to score using a simple linear equation.

  • Preheat the node by using penalty and probe to minimize charge;

  • If a node with a low score is blacklisted, the node indicator gradually returns to its initial state (default value) by using statistical attenuation.

After optimizing the load balancing algorithm, we achieved a good profit.

Second, the current limit

Avoiding overload is an important goal of load balancing. As the pressure increases, no matter how efficient the load balancing strategy, some part of the system will become overloaded. We prioritize graceful downgrading, returning low-quality results, and providing compromised service. In the worst case, properly limiting traffic ensures that the service itself is stable.

In terms of current limiting, we think we should focus on the following points:

  • One is the limitation of QPS, which brings different request costs and difficult static threshold configuration.

  • Second, according to the importance of the API, according to the priority discard;

  • Third, set limits for each user. When global overload occurs, it is critical to control certain “exceptions”.

  • Fourth, it also costs to refuse requests;

  • Fifth, the operation and maintenance cost caused by traffic limiting for each service.

In the traffic limiting strategy, we first adopt distributed traffic limiting. A quota-server is implemented for backend to control each client. That is, Backend requests the quota-server to obtain quotas.

This has the advantage of reducing the number of requests to the Server and then directly consuming it locally. At the algorithm level, the maximum and minimum fairness algorithm is used to solve the hunger caused by a heavy consumer.

On the client side, when a user over resource allocation, the back-end tasks quickly rejected the request, to return to the lack of “quotas” mistakes, may the back-end busy constantly send to refuse the request, lead to overload and rely on a lot of mistakes, resources in the protection of the downstream two situation, we select directly in the client side to flow, and not sent to the network layer.

We learned an interesting formula in Google SRE, Max (0, (requests -k *accepts)/(requests + 1)). Using this formula, we can tell the client to send requests directly, and once the limit is exceeded, the flow is blocked according to the probability.

In terms of overload protection, the core idea is to discard a certain amount of traffic when the service is overloaded to ensure that the system is close to the peak traffic when it is overloaded, so as to protect itself. Common practices include discarding traffic based on CPU and memory usage. Use queues for management; Controllable delay algorithm: CoDel et al.

In simple terms, when the CPU we reached 80%, this time you can think it is close to overload, if the throughput of 100, when the request of the instantaneous value is 110, I can lose the 10 flow, in this case the service can be self-protection, we based on the idea finally achieved a overload protection algorithm.

We use the sliding mean of the CPU (CPU > 800) as the heuristic threshold, once triggered, to enter the overload protection phase. The algorithm is :(MaxPass * AvgRT) < InFlight. MaxPass and AvgRT are statistics of sliding time window before triggering.

After the current limiting effect takes effect, the CPU jitter near the critical value (800). If the cooling time is not used, a short CPU drop may result in a large number of requests being allowed, or even full CPU. After the cooling period, check whether the CPU continues to enter overload protection.

Third, try again

Traffic typically moves from BFE to SLB, through API gateway to BFF, microservices to the database, and there are many layers to the process. In our daily work, what should we do when the request returns an error and the backend part is overloaded?

  • First we need to limit the number of retries and a strategy based on retry distribution;

  • Second, we should only retry at the failure layer, and when retries still fail, we need to globally contract error codes to avoid cascading retries.

  • In addition, we need to use a randomized, Exponential enrichment cycle, which can be illustrated by Exponential Backoff and Jitter;

  • Finally, we need to set the retry rate metric for fault diagnosis.

On the client side, speed limiting is required. Since users frequently try to access an unreachable service, the client needs to limit the frequency of requests by attaching the interface level error_details to the response returned by each API.

Fourth, the timeout

As we mentioned earlier, most failures are caused by improper timeout control. The first is the high latency service with high concurrency, which leads to client accumulation and thread blocking, where upstream traffic keeps flooding in and eventually causes failure. So, essentially understanding timeout is a Fail Fast strategy, where we consume as many requests as possible, and this kind of pile up of requests is basically discarded or consumed.

On the other hand, when an upstream timeout has already been returned to the user, it may still be executed downstream, which can lead to wasted resources.

One more question, when we tune downstream services, how exactly do we configure timeout and how should the default value policy be set? In the production environment, handshakes or incorrect configurations may cause configuration failures or faults. So it’s best to do some defensive programming at the framework level and keep it as close to a reasonable range as possible.

The key to in-process timeout control is to check whether there is enough left to process a request before each phase (network request) begins. In addition, there may be some logical computation in the process, which we generally consider to be less time, so we generally do not control.

Many RPC frameworks now do cross-process timeout control. Why do this? Cross-process timeout control can also refer to the idea of in-process timeout control, passing the source data through RPC, bringing it to the downstream service, and then using quotas to continue the transmission, ultimately making the upstream and downstream links less than one second.

Five, deal with chain failure

In combination with the four aspects we discussed above, we have the following key points to consider when dealing with interlocking failures.

First, we need to avoid overloading as much as possible. Because when nodes fail one after another, eventually the service will avalanche and the fleet may go down, so we mentioned self-protection.

Second, we limit the flow through some means. It allows a client to control the high volume of concurrent requests to the service, so that the service is not easy to die. In addition, when we can not provide normal services, we can also do damage services, sacrificing some non-core services to ensure key services, so as to achieve elegant degradation.

Third, in terms of retry strategy, retreat as far as possible in micro-service, and consider the impact of traffic multiple amplified by retry on downstream as far as possible. In addition, in the case that mobile terminal users cannot use a certain function, they usually refresh the page frequently, which causes traffic impact. We also need to cooperate with mobile terminal to do flow control.

Fourth, timeout control emphasizes two points, timeout within a process and delivery across processes. Ultimately its timeout link is determined by the uppermost node, and as long as this is done, I think it is highly unlikely that a chain failure will occur.

Fifth, change management. We usually release because of some changes, so we still need to strengthen change management. Destructive behavior in the change process should be punished. Although it is not for people, it should still be punished to attract attention.

Sixth, ultimate pressure test and failure drill. When doing manometry, the pressure may be enough to report an error and stop. I suggest that the best is in the case of error, still want to continue to pressure, see your service is a performance? Can it provide service under overload? After the overload protection algorithm, continue to pressure, actively reject, and then combined with the fuse, can produce a three-dimensional protection effect. Frequent failure drills will result in a quality control manual that everyone can learn from, which will not be easily flustered, and can be quickly resolved when problems do occur in the production environment.

Seventh, consider capacity expansion, restart, and elimination of harmful traffic.

The reference shown above is a classic complement to the above strategies and a metaphysics for solving various service problems

Six, Q&A

Q: What metrics are used for load balancing?

A: We use the server side, mainly the CPU. I think the CPU is the best example. From a client perspective, I’m using health, which is the success rate of the connection. Latency is also an important metric, and we need to consider how many requests each client sends to different back ends.

Q: Do BFE connect to SLB through public network or private line?

A: Actually, there are public and private lines.

Q: If the client is thousands of magnitude and pingpong every 10 seconds, that’s literally hundreds of QPS? High CPU overhead?

A: If your client number is several thousand, but upstream your various services add up to A very large number of clients, probably over ten thousand. So it can be quite high CPU overhead because there are so many different applications coming to HealthCheck, which can be quite high.

Q: What are the costs of multiple clusters?

A: Cluster. The multiple cluster mentioned above is more in the same machine room layout of multiple sets of machines, so this multiple sets of clusters, first of all, it is sure that the resources are redundant and doubled. There is a cost to this, so we will not do this redundancy for all services, only for core services. So it’s essentially spending some money, doing some redundancy, trying to improve our availability as much as possible, because when you have a lower level of service failure, it’s really spread out.

Q: Is the timeout pass too strict? Not if one node fails.

A: This strategy is called timeout pass, which we do by default, so in some cases, even if the timeout continues, this behavior can actually be overwritten by our context, so it depends on the logic of your code.

Q: How is the quality and capacity of a user’s A-Node balanced?

A: It depends on the scheduling strategy. Generally speaking, it depends on the purpose of your service first. If it is user-oriented or functional, I think quality is the priority. Secondly, in the case that the machine room you forward is not loaded, try to deliver it to the nearest node. In extreme cases, your machine room may be overloaded. In this case, you actually have to forward it to other core machine rooms through access nodes.

Original text: cloud.tencent.com/developer/a…