1, the introduction of

What is service degradation? When the server pressure increases dramatically, according to the actual business situation and traffic, some services and pages are strategically not processed or processed in a simple way, so as to release server resources to ensure the normal or efficient operation of core transactions.

If you still don’t understand, here’s an example: A lot of people want to pay me right now, but my server is running some other service besides the payment service, such as search, scheduled tasks, details, etc. However these are not important service will take up a lot of memory and CPU resources of the JVM, in order to close down all your money (money is the goal, I designed a dynamic switch, these not important services directly in the outermost layer down, this kind of treatment of the back-end processing to receive the money after service have more resources to collect money (money faster), This is a simple service degradation usage scenario.

2. Application scenarios

What scenario is service degradation mainly used in? When the overall load of the microservice architecture exceeds the preset upper threshold or the incoming traffic is expected to exceed the preset threshold, we can delay or suspend the use of some non-important or non-urgent services or tasks to ensure the normal operation of important or basic services.

3. Core design

3.1 Distributed Switch

Based on the above requirements, we can set up a distributed switch for service degradation and then centrally manage the switch configuration information. The specific plan is as follows:

Service degradation – Distributed switch

3.2 Automatic Degradation

  • Timeout degradation – Configure the timeout period and timeout retry times and mechanism, and use an asynchronous mechanism to detect recovery

  • Failure degradation – These are mainly unstable apis that are automatically degraded when the number of failed calls reaches a certain threshold. Also use asynchronous mechanisms to detect recovery

  • Failure degradation – If the remote service to be invoked hangs (network failure, DNS failure, HTTP service returning an incorrect status code, or RPC service throwing an exception), you can directly degrade

  • Traffic limiting Degrade – When a traffic limiting excess is triggered, temporary blocking can be used for temporary blocking

When we go to the second kill or buy some limited purchase goods, the system may crash because of too much traffic. At this time, the developer will use the flow limit to limit the traffic. When the flow limit threshold is reached, the subsequent requests will be downgraded. After the downgrade, the solution can be: queued page (redirect users to queued page and try again later), no stock (directly tell users that they are out of stock), error page (if the activity is too hot, try again later).

3.3 Configuration Center

Configuration information for microservice degradation is centrally managed and then operated in a visual friendly manner. The need for network communication between configuration center and application, so may be due to factors such as network failure or restart, lead to information loss, restart, or network configuration push back can’t accept, change, and so on situation not in time, so service downgrade the configuration of the center need to implement the following features, as much as possible, to ensure the configuration changes to:

Service degradation – Configuration center

  • Start active pull configuration – for initial configuration (reduce the first timed pull cycle)

  • Publish and subscribe configuration – for timely configuration changes (around 90% of configuration changes can be resolved)

  • Timed pull configuration – used to resolve publish subscription lapses or disappears (can resolve about 9% of message changes with publish subscription lapses)

  • Offline file cache configuration – A temporary solution to the problem of not connecting to the configuration center after a restart

  • Editable configuration document – used to implement configuration definitions by editing the document directly

  • Provides Telnet command to change configuration – to solve the common configuration center failure and cannot change the configuration

3.4 Handling Policies

When new transactions arrive again after triggering service degradation, how do we handle these requests? From a microservices architecture global perspective, we usually have the following common degradation solutions:

  • Page degradation – Disable button clicking for visual interface and adjust static page

  • Deferred services – such as delayed processing of scheduled tasks, delayed processing of messages into MQ

  • Write degrade – Service requests that directly disallow related write operations

  • Read downgrade – Directly disallows relevancy service requests

  • Cache degradation – Use caching to degrade some of the service interfaces that are read frequently

For the degradation strategy at the back-end code level, we usually use the following measures for degradation:

  • Throw exceptions

  • Returns NULL

  • Calling Mock data

  • Call Fallback processing logic

4. Advanced features

We’ve created a downgrade switch for each service, we’ve verified it online, and it feels totally fine.

Scenario 1: One day, the operator has an event, and suddenly comes in and says, now the traffic has almost reached its limit, is there a way to downgrade all the unimportant services in batches? Development a face meng force of look, this is not operation DB, where there is batch operation.

Scenario 2: One day, the operation has another problem and says that we will hold an activity later, so we should quickly downgrade all the non-important services in advance. The development is confused again. How can I know which services to downgrade?

Reflection: although the function of service degradation is implemented, but did not consider the implementation of the experience. Too many services do not know which services to degrade, and the degradation rate of a single operation is too slow.

4.1 Tier Degradation

When the micro-service architecture has different degrees of problems, we can choose to abandon it according to the comparison of services (i.e. the principle of losing the car and keeping the boss), so as to further guarantee the normal operation of core services.

If you wait until an online service is about to fail to choose which services to downgrade and which not to downgrade individually, when there are hundreds of services online, you will fail before you have time to downgrade. At the same time, it is also a lot of work to comb before activities such as push or kill. Therefore, it is recommended that the architect or core developer comb the initial evaluation value, that is, the default value of whether or not it can be degraded, in advance.

In order to facilitate The batch operation of service degradation in micro-service architecture, we can establish an evaluation model of service importance from a global perspective. If possible, it is suggested to use The analytic hierarchy process. AHP) mathematical modeling model (or other models) for qualitative and quantitative evaluation (certainly many times better than the architect’s head to decide whether to downgrade, but also much more difficult and complex, i.e. you need someone who can do mathematical modeling), The basic idea of analytic hierarchy process is that people’s thinking and judgment process of a complex decision-making problem are basically the same.

The following is the final evaluation model given by individuals, which can be designed as the evaluation reference model of service degradation:

By means of mathematical modeling or direct head-pounding by architects, combined with the priority principle of whether services can be degraded or not, and based on the level of typhoon warning (which all belong to storm warning) for reference design, we can classify all services of microservice architecture into the following four fault storm levels:

Evaluation model:

  • Blue Storm – Indicates the need for minor downgrades of non-core services

  • Yellow Storm – Indicates a moderate downgrading of non-core services

  • Orange Storm – Indicates the need for large-scale downgrades of non-core services

  • Red Storm – Indicates that all non-core services must be downgraded

Design description:

  • The fault severity is blue < yellow < orange < red

  • It is suggested that services can be divided into 80% non-core services and 20% core services according to the 80/20 principle

The above model is only a service degradation assessment model for the overall microservice architecture. When promoting or killing activities, it is recommended to build them based on specific themes (for activities of different themes, it is more reasonable to use different services for degradation). Of course, the model can be the same, but the data needs to be different. It is better to establish a set of model library, and then only need to input relevant services during the implementation to output the final downgrade plan, that is, output the list of services that need to be degraded in case of blue storm or yellow storm in case of this big push or second kill.

4.2 Downgrade weights

In microservice architecture, there is the concept of service weight, which is mainly used for weight selection in load. Similarly, service degradation weight is also used for fine-grained priority selection in service degradation. All services directly use the above simple four-level division for unified processing, obviously too coarse granularity, or for the same level of multiple services need to degrade the order of how to degrade? Even if I want artificial intelligence to automatically degrade, how much more granular control should I have?

Based on these AI-based requirements, we can assign a degradation weight to each service to facilitate more intelligent service governance. The values can also be assessed qualitatively and quantitatively using mathematical models, or can be determined directly by the architect’s head based on experience.

5. Summary and outlook

The above provides semi-practical and semi-theoretical service degradation schemes, and users can make appropriate choices according to the actual situation of their companies. The author has not found any implementation of the complete scheme, but it can be suggested that large factories with long-term service governance planning carry out the research and implementation of the complete scheme. Will have better governance value for the future era of artificial intelligence and everything connected (personal opinion). For small factories, such a complex scheme is not recommended for the consideration of cost and value, but distributed switching and simple grading and degradation can be realized.

In this paper, service degradation is the core of a more ideal governance microservice architecture, which suggests the use of appropriate models in the field of mathematics to achieve reasonable qualitative and quantitative analysis and governance of microservices. Provide solution support for Artificial Intelligence Governance Micro Service (AIGMS) in the future.

In order to thank readers who pay attention to the pure smile of the official account, WE have specially applied for a batch of coupons from 51CTO. Each coupon is worth 5 YUAN. You can use this coupon to buy columns on the official website.

I recommend you to buy the following course “Micro-service Technical Architecture and Big Data Governance Practice” opened by me in 51CTO. Of course, you can also buy other courses you like.

As long as you register a 51CTO account and fill in the form below, the coupons will be issued within 24 hours in general. Top 50 only!

Registered address: https://www.diaochapai.com/survey2825434

END