Double eleven that night of eleven o ‘clock, I came home from work, see usually very early to sleep girlfriend today actually did not sleep. So I asked her:


Service degradation: when the server pressure increases sharply, according to the actual business situation and traffic, some services and pages are strategically not processed or processed in a simple way, so as to release server resources to ensure the normal or efficient operation of core transactions.

Service degradation

The concept of service degradation may not sound easy to understand, but consider a real-life example.

Sometimes when we go to a restaurant for dinner, the waiter will take a questionnaire and ask the diners to fill in the feedback. But the request for feedback only comes up when the store is not busy. If the store is very busy and full of customers, the staff will stop asking diners to fill out questionnaires.

In fact, this is service degradation. During times of heavy traffic, users reported that the feature was downgraded. Because he’s not that important.

Then there is the degradation of distributed systems.

The picture above shows a detailed page of a Taobao product that will be all too familiar to many buyers.

However, I have roughly counted this page, and there are at least more than 15 functional modules, such as: picture, title, pricing, inventory, recommendation, evaluation, logistics, collection, order and so on.

All of these features are on the same page, but not all of them are in the same app. These dozens of modules may be implemented separately in dozens of applications.

Detail pages interact with dozens of applications on the web when rendered.

Some of these functions are very important, such as pricing, inventory, ordering, etc. Others are less important, such as recommendations, favorites, etc.

This process of identifying which functions are core functions and which functions are non-core functions, and then adopting inappropriate downgrade plan for non-core functions is called the downgrade plan.

On the day of Double eleven, the traffic of the whole website is very huge, and the page view of the details is the disaster of the whole website. Therefore, when there is a big push, the main features need to be limited, and the secondary features can be demoted, that is, not showing certain modules, or returning some default content.

Ways to downgrade

Also take the previous hotel to do a questionnaire survey for users as an example. When there is a large number of people, directly canceling the questionnaire is just one way. There are many other ways to choose. Such as:

1. First let the user fill in the mobile phone number, and then after checking out, send text messages to the user and ask him to fill in the electronic questionnaire.

2, put a questionnaire in front of the store, users leave to fill in the questionnaire. Wait, there are a lot of options if you want to.

Similarly, for large websites, there are many ways to choose the service degradation, the common ones are as follows:

Delay service

For example, if a comment is posted, an important service, such as showing up fine in an article, is delayed and adds points to the user, just put it in a cache and wait until the service is stable.

Shutting down services (fragment degradation or service functionality degradation) at granularity

For example, close the recommendation of related articles, directly close the recommendation area

Page asynchronous request degradation

For example, there are recommendation information/delivery to asynchronously loaded requests on the product details page. If the response of these information is slow or the back-end service has problems, it can be degraded; Page hopping (page degradation)

For example, there can be related article recommendations, but more pages will be directed to a specific address.

Write down

For example, we can only update the Cache, and then asynchronously reduce the inventory to DB to ensure final consistency. In this case, we can downgrade DB to Cache.

Read the drop

For example, the multi-level cache mode can be downgraded to read-only cache if the back-end service has problems. This mode is suitable for scenarios that do not require high read consistency.

Degraded intervention

According to whether it can be automatic downgrade, there are two ways of intervention, namely: automatic switch downgrade and manual switch downgrade.

Automatic switch degradation

When the system meets certain conditions (such as system load, resource usage, and SLA), the system automatically implements some policies.

Common indicators that can be used as automatic downgrade conditions are as follows:

Service timeout

When the database/HTTP service/remote call response is slow or slow for a long time, and the service is not a core service, it can be automatically degraded after timeout;

For example, the details page mentioned above has the functions of recommendation and favorites. Even if there is a problem, it will not affect the normal ordering of users. If you are calling someone else’s remote service, define a maximum service response time with the other party. If the service response time is exceeded, it can be automatically degraded.

Number of failures

The most common exception when calling an external service, aside from timeout accidents, is call failure. For example, inventory information in the details page, if a query request fails, can then be directly degraded by reading cached data and so on.

One problem with this degradation, however, is that even though one request exposes the cache, other users will still query the inventory information when accessing it, which makes the inventory system worse. Because he might already have a problem, but the upstream system keeps sending requests to him.

So, you can do a uniform downgrade for this inventory query interface. Set a threshold for the number of failures. Once the total number of failures reaches this threshold, the query interface will be degraded in the subsequent period. Until its function is restored.

failure

The failures mentioned above may be caused by service instability and can be automatically recovered after a period of time. Another possibility is that the dependent service is down completely, or the network is down, and so on. This situation can be directly degraded.

When an HTTP request returns a fixed error code, or an RPC request throws an exception when the underlying service considers a failure to occur, it can be degraded.

Current limiting the drop

Another strategy that is common on e-commerce sites is downgrading. For some functions, set a traffic threshold and degrade once the traffic reaches the threshold.

For example, if the traffic is too heavy in a moment, traffic limiting can be degraded. For the subsequent access to the user directly prompt empty, jump to the error page, or let him enter the verification code to try again.

Manual switch degradation

There is also a degradation, that is, the artificial switch degradation.

Manual switch degradation refers to a method by which system maintenance personnel manually modify parameters or disable services after discovering system exceptions.

The advantage of this approach is that it is flexible and can respond flexibly according to abnormal situations; However, the disadvantage is that the requirements for people are relatively high. On the one hand, maintenance personnel need to have enough understanding of the system, and on the other hand, maintenance personnel are required to deal with system anomalies in the first time.

In other cases where manual intervention is possible, the developer can manually downgrade a feature to save resources and keep the main process available by identifying risks ahead of a major push in anticipation of heavy traffic.

The manual switch degradation mentioned here does not necessarily have to be manually operated, but may also be triggered by a timed task.

Demotion tool

Currently on the market, there are two options for traffic control, namely Netflix Hystrix and Alibaba Sentinal.

Hystrix



Hystrix is a library that provides fault tolerance between services, mainly in the aspects of delay and fault tolerance, to control linkage faults in distributed systems. Hystrix improves the resiliency of this distributed system by isolating access points to services, preventing linkage failures, and providing solutions to failures.

Hystrix focuses on fault tolerance with isolation and fuses, calls that time out or are fuses fail quickly, and provides fallback mechanisms.

Sentinel



Sentinel is a lightweight and highly available flow control component that is open source by Ali middleware team and oriented to distributed service architecture. It mainly takes traffic as the entry point and helps users protect the stability of services from multiple dimensions such as flow control, fuse downgrade and system load protection.

Sentinel focuses on diversified flow control, fuse downgrading, system load protection, real-time monitoring and console

contrast

This is Sentinel versus Hystrix from the Sentinel document.