This article has participated in the good article call order activity, click to see: back end, big front two track submission, 20,000 yuan prize pool for you to challenge

A true master always has the heart of an apprentice

The introduction

In a microservice architecture, as the number of services that are dismantled increases, so does the number of places where services fail, it is worth thinking about how to keep the service architecture robust. Microservice fault tolerance is one such stability solution that can understand the fuses of the micro-service architecture and form an effective protection mechanism for the business platform. The fault tolerance mechanism is the last barrier for the stable operation of the platform when the platform is abnormal.

Why does microservices architecture need fault tolerance

It may be said that there are some years, before childhood home often appear voltage instability lights flickering, sometimes even short circuit power failure. Every time when it is pitch black, the adults most often say a sentence is: Ah, it is estimated that the fuse is broken again. The fuse here is a means to protect the electrical circuit in the home, when a short circuit occurs, the electrical appliances in the home will be protected from the current overload by fusing the fuse.

The fuses in microservices architecture are the basic components that act as fuses. When we design the architecture, we not only need to meet the business requirements, but also need to design for failure, which means that when external conditions change or internal anomalies occur, the platform architecture can minimize the impact of such anomalies. Strong fault tolerance is a key indicator of good architecture.

Going back to today’s topic fault tolerance mechanism, we can think in reverse, if there is no if there is no circuit breaker downgrade system fault tolerance mechanism, the entire system platform, what will happen in abnormal circumstances, let’s first look at the following scenarios.

Scenario 1: The service node exception affects the upstream service caller

Suppose we have client services. We need to call the interface of Service1 cluster, interface of Service2 cluster and interface of Service3 cluster respectively to complete a business process. If Service3 cluster is abnormal, the services are still there. However, due to fullGC, slow query, and service exceptions, timeout may occur when the client calls the Service3 cluster, and the client cannot respond to the service within the specified time. When the call requests are continuously issued, the Client worker threads will be blocked by the time out calls. When the service requests are continuously made, the Client worker threads will be blocked more and more, causing the Client to be unavailable.

When the Client is unavailable, the upper-layer callers may also be affected upward by the unavailability of the Client, resulting in exceptions to the upper-layer callers, which are transmitted layer by layer just like a virus. The impact of exceptions is continuously amplified upward, and finally the whole platform becomes unavailable.

Scenario 2: The service becomes unavailable due to traffic surge. Other services that depend on the service are affected

It is assumed that the Service1 cluster can carry 6000QPS of traffic. Normally, the total traffic of the three upstream services is less than this threshold. However, when the traffic surges, the QPS of one service exceeds 10,000 QPS, which exceeds the service capability of the Service1 cluster. As a result, the response of the cluster is abnormal. In this case, the Service2 cluster and Service3 cluster depend on the Service1 cluster service. Threads can also block, resulting in an exception for the entire platform.

Therefore, based on the above analysis, the introduction of circuit breaker and degradation components in microservices architecture is to improve the overall fault tolerance of microservices architecture. Avoid the following three scenarios that affect platform stability.

1. If a single node in a service cluster fails abnormally, its impact range may be infinitely amplified to upstream services;

2. Multi-tenants interact with each other when basic services are abnormal due to the use of common infrastructure services;

3. When the instantaneous traffic of a certain service increases suddenly, a certain service cluster can’t sustain it, affecting the stability of the entire platform;

How to broken

Resource isolation

Let’s take a real cabin as an example. The actual cabin pattern is roughly as follows. The bottom of the cabin is not a completely hollow structure, but is isolated by lattice. Why? The main aim is to protect the hull from a leak by affecting only one of the isolated areas, rather than filling the whole cabin with water.

So with the idea of cabin isolation, could we use resource isolation to protect our microservice architecture in our programming world? The answer is yes, and it does.

1. Thread pool isolation

We can realize resource isolation through thread pool isolation. Different requests are processed by the corresponding thread pool. Even if the request resources are time out, the resources of the current thread pool will be affected at most, but the thread resources of the whole service will not be affected. Similar to the containment area in the cabin.

2. Semaphore isolation

The semaphore is used to control the number of threads. The maximum number of concurrent calls can be specified. If the number of semaphore is exceeded, the requests can be discarded or delayed to prevent service exceptions caused by the continuous growth of threads.

fusing

The so-called fuse, as mentioned above, its function is like a fuse, in the case of too much traffic or request error rate is too high, the fuse will be blown, the corresponding business link is disconnected, no longer provide services. When the traffic returns to normal or the faults decrease, the fuse breaker is turned on again and the previous service link recovers. This is a great way to protect back-end microservices. During the promotion period, the platform needs to use enough machines to ensure the normal operation of the core commercial links. For refund and return services, the platform can temporarily disconnect the services and restore them when the promotion period is over.

In the circuit breaker mechanism, the core content is the design of circuit breaker, circuit breaker design mainly has two aspects: one is the design of state transition, the other is how to perform the core circuit breaker function according to the threshold value with statistical data.

demotion

When the system traffic is too much, the system resources are limited, and the platform cannot handle so many requests. At this point, you can degrade less important function modules and stop external services, which can release more resources for the use of core functions. As shown below, in the commodity details page, the list of goods in the commodity service side is the most important service. As for the user’s points and the user’s profile picture, they are not the core business. Therefore, under the limited system capacity, the commodity service is given priority to provide external services, and other services are degraded.

conclusion

This paper mainly analyzes the fault-tolerant mechanism in the microservice architecture, from why it is necessary to have fault-tolerant mechanism to how to realize the fault-tolerant protection of microservices by means of resource isolation, circuit breaker and degradation. Stay tuned for the next article that will show you how the circuit breaker downdowngrade component Hystrix works.

I’m Mu Feng, thank you for your likes, favorites and comments. I’ll keep you posted. See you next time!

Wechat search: Mufeng technical notes, quality articles continue to update, we have a learning group can pull you into, work together to impact the factory, in addition to a lot of learning and interview materials for you.