Hardware failures, black swan events, traffic spikes, resource failures, DDoS attacks, and program code bugs all affect the stability of the system. So many reasons can lead to the instability of the system, how to solve it?

Generally speaking, it is divided into four aspects, one is to do a good job in advance prevention, two is the timely discovery of failure, three is to do efficient troubleshooting, four is the rapid recovery.

Let me share with you a little story before we talk about pre-prevention of breakdowns.

Bian Que, a famous doctor in the Spring and Autumn Period, was the third in a family of three brothers who were all doctors. Once King Wen of the State of Wei asked him, “Which of you three brothers has the most excellent medical skill?” “Bian Que replied,” My eldest brother is the best in medical skill. My second brother is the second, and I am the worst.” “King Wen asked,” Aren’t you the most excellent doctor? It has long been well known.” Bian Que explained, “The elder brother can often detect and prevent patients before they get sick. The second brother can use medicine to prevent patients from getting worse when they have minor ailments. Therefore, people think they are only good at treating minor ailments, but in fact they are not.


My patients are usually very sick, and I have to open up their stomachs and operate on them, which is why people think I’m the best. But the healer’s best skill is to prevent disease.”

This shows how important it is to prevent the failure in advance, and the test itself is greater for the technical team. So what should we do?

The design phase

The first thing to do is to simplify the design of the system as much as possible in the design. Complex distributed architecture does not have enough energy to maintain it, and problems will only occur frequently in the end. Therefore, from the perspective of stability, a simple and usable system architecture suitable for the current situation of the product is the best.

Second, we need to understand the overall situation and system indicators as much as possible. As a system researcher, only by understanding the overall situation of the system can we better deal with problems when they occur.

The third is to choose the appropriate components for the system. The usability and performance of the components themselves, the maturity of the component community, and the maintenance cost of the use are all factors that need to be considered. From the perspective of the project, generally, the initial project is more accustomed to use more radical technologies, but when the project reaches the middle and late stage, in order to ensure the stability of the system, more mature technologies and components are preferred, so that the subsequent maintenance will be simpler.

In addition, it is also necessary to standardize the design review, especially in the larger design, storage architecture, storage data, cache, database design should be documented, the thought of the problem proposed in advance, to avoid problems in the development process in the future rework.

Deployment phase

Change is the main cause of system failures at this stage. The Google team mentioned in the previous post that about 70% of the failures were caused by change, but in fact it is possible that more than 90% of the failures were caused by change.

To avoid the impact of changes, the first thing to do is to standardize the process of online, including the time, frequency and cycle of online, etc. Like most now is set only on Monday and Thursday can online 1 times a day, on Friday and is forbidden online before the holidays, online time also has a specification, not in the evening rush hour is when after 5:30 banned online, after another online, we will monitor and observation, found the problems appeared in the process of online as soon as possible.

Grayscale, preheat, and slow start are all important ways to avoid problems with changes, which will not be expanded here.

In terms of capacity assessment, there are generally several things to be done. The first thing is to determine the source of traffic, such as whether the traffic comes from H5 or from a third-party interface during a big promotion event. After the source is determined, a target value can be set according to past experience.

The purpose of combing links and services is to conduct better capacity assessment by combining monitoring and full-link pressure measurement. The setting of water level can help to judge whether the previously set target value is reasonable, and to see whether we should optimize or expand capacity, so as to finally determine the number of machines and the number of resources.

If you are interested in the full link compression test mentioned above, you can review the previous article for more details. 👉 20 ask the whole link pressure test dry cargo summary (on) the most complete network

After good design and deployment, we should think about how to reduce the probability of failure through means. There are two main ways: one is redundancy, and the other is isolation.

Redundancy includes computational redundancy and storage redundancy, and storage redundancy is more commonly replicated, including synchronous replication, semi-synchronous replication and asynchronous replication. Storage redundancy is inherently more complex because of the need to copy data from one place to another and to maintain data integrity. On the other hand, there is computational redundancy. The common methods are load balancing server and multi-room strategy.

Isolation refers to service isolation. A service can be divided into multiple sub-modules according to its function or importance, and each module will not affect each other. This idea comes from the design of the cabin. Large ships will have many cabins, but it will ensure that each cabin is isolated from each other. In this way, even if one cabin fails to take in water, it will only affect this cabin, and will not drag the whole ship down. That is the same idea in service isolation, you can choose to give up some functions or the stability of the service to ensure global stability.

So much for the prevention of breakdowns, how do we deal with them when they happen? Are your monitoring metrics adequate? How do you set targets?

Is your alarm logic correct? What warning principles should be followed?

How can you repair the problem quickly? What should fuse, downgrade, flow control do?

For more detailed answers to these questions, follow Sequence Technology and reply to the key words [71] to view the original video of the speech



Contributing: Tang Yang, senior technical expert of Meitu

Tidy up: thirty