Hi, I’m Bella Jam

“It’s not like it can’t be used. It can be used.” “It’s not like I can’t run, if I can. One of the programs and one of the people can run.”

I believe many students have heard these two sentences. It sounds all right. Programming 3 songs, “Make It Run, Make It Fast, Make It Better”. “Make it run,” I said. “Make it run,” I said. “Make it run.”

What is stability? I categorize this as the “Make it Better” phase. Stability is to find problems in advance through a series of means and try to nip them in the bud. Stability requires that problems be perceived and dealt with before the business, so as to minimize the impact of the problem. Stability requirements after the occurrence of a problem to have a review, review where can be improved to avoid the recurrence of such things. In short, stability is to ensure the normal operation of the system, to ensure that the business is as smooth as silk, to ensure that the data is accurate.

Stability is not only able to do some piecemeal things, stability is also a rule to follow, there is a systematic approach. This paper will explain the stability of the business team in daily work from the aspects of mechanism control, monitoring and warning, sorting out the system risk points, life protection measures, online problem emergency mechanism, fault drill, post-event review, and publicity, which covers all aspects before, during and after the event.

1. Mechanism control

Everything, can on the system on the system, relying on people to run the risk is extremely high. I believe that many students understand this truth. The same is true of stability assurance, which regulates some behaviors in daily work through a series of mechanisms and systems.

Here are some things to consider:

1) branch publishing must merge the origin master branch code.

2) Branch release must be CR, and must be approved by the tester before release. The development and testing cannot be the same person, unless the testing is approved by the development of self-testing. Publisher and CR personnel cannot be the same person.

3) The branch release shall be recorded in the system, including the release time, release branch, release content, influence area, whether the test passes, and so on.

4) Set up a release window. If the time outside the window needs to be released, the approval personnel can be the boss, the staff in charge of stability work in the team, or other personnel depending on the situation of the team, such as the team core development, etc. The main purpose of the release window is to prevent the release at night or on the weekend, to avoid the release of problems when the response and processing is not timely.

5) Develop the release process, mainly in the server environment. For example, the release system setting branch must be successfully deployed in the daily environment before pre-release can be deployed; Deployment online is possible only if the pre-deployment environment is successful. When publishing online, it must be successfully deployed in the Beta environment and observed within the specified time before continuing to publish online. Publish online in at least several batches, watch each batch for at least how long, etc.

6) prohibit flow one-size-fits-all, there must be a gradual gray process. Through the release of the ratio of the machine, through the business with gray standard or way, must be gradually gray, prohibit the flow of one-time all switch. The gradual grayscale process is a process of verification function and exposure of problems. The grayscale range is controllable and can be informed to the business side, so that when there is a problem, the business side will know where the problem is, rather than scratching their heads.

7) Data repair must be operated by tools. The tools should be able to obtain the current operator and operation time, etc. The tools should have the ability to check the correctness of data repair. Each use of the tools should be recorded in the background for the convenience of viewing the historical operation records in the future.

2. Monitor alarms

The significance of monitoring alarm is to find the problem before the business side, which can greatly improve the response speed and start to locate the problem faster. Once the problem has been identified, the impact area can be assessed. At this point, consider how to quickly stop the bleeding and minimize the impact area.

The following points should be noted when monitoring alarm Settings:

1) Alarm range. It’s too risky to alert only one person, for example, who might miss the alert while washing. You can set up a spike alert group, including students who are related to the business, as well as students who are responsible for stability in the team, the boss, etc. This avoids the “single point of unavailability that causes the service to go down”, even if one person misses the alert, there are others, as long as someone sees the alert and acts on it.

2) Alarm priority. Different alarm priorities need to be set for different situations. Take a simple example, in terms of service success rate, the success rate lasting for 3 minutes is less than 97%, and the alarm can be sent to the spike alarm group; 5 minutes of success rate is less than 97%, can be SMS alert to subscribers; Length of 10 minutes success rate is less than 97%, can be direct telephone subscribers.

3) Alarm content. This depends on the service status of the team. Generally speaking, monitoring alarms at the DB level, monitoring alarms at the service success rate level, monitoring alarms at the service processing failure number and monitoring alarms at the machine level are necessary. DB monitors alarms such as CPU usage, connection count, QPS, TPS, RT, disk usage, etc. Machine level monitoring alerts such as CPU utilization, load, disk utilization, etc. The duration of the alarm configuration for the service success rate needs to be considered.

4) Verification of the limitation of alarms. This is very important, if the alert is not effective, the above three alert scope, priority, content work is useless. If the alarm threshold is set too high, the alarm will have no effect. If the alarm threshold is set too low, you will receive a large number of alerts every day, and over time, you will get tired of them, and one day when the Wolf actually comes, people may not believe it. After the pressure test, the baselines of the service, DB, machine, etc., are found out, and the alarm set according to the baseline is the most reliable. If there is no pressure measurement in the early stage, you can observe the daily water level first, set the alarm threshold a little lower, and then slowly adjust the alarm threshold according to the alarm situation.

3. Sorting out system risk points

The ancients said that if you know yourself and your enemy, you can win a hundred battles. Why do we need to know that? You know his weakness only by knowing him. The same is true for stability. You know the system, you know the risk point of the system. Of course, it takes a lot of time and effort to fully understand a system. The ancients cloud, the gentleman’s nature is not different also, good fake in the material also. In the early stage of sorting out the risk points of the system, you can borrow the help of your lovely colleagues to find the owner of the corresponding application, understand the current situation, the potential risk points and whether there are any countermeasures.

When sorting out the system risk points, the following points need to be paid attention to:

1) Whether the link is closed. This is critical, because if it’s a system-level defect, it’s going to have a huge impact. When combing a link for closed loop, you need to think outside of development and look at a product to consider whether it can run normally under any circumstances. If you find that the system is not closed, you must inform the product and boss in the first time. Then think from a product perspective about how to fix the problem, what features need to be developed to make it OK, and schedule the fix as a high priority development task.

2) Slow SQL. The power of slow SQL is very large. A slow SQL execution may directly destroy a library. At this time, if other SQL requests are executed, the execution time of other SQL will be much larger. At this time, you need to calm down to slowly find the real slow SQL according to the time point, or ask for help from DBA students, professional students to do professional things, the results will be more reliable, less time. If you find your own slow SQL, you’d better check with your DBA classmate to see if the results you find are correct, so as not to miss the real slow SQL.

3) Whether core applications and non-core applications influence each other. Businesses with different levels of importance should be directly separated in the physical dimension. This is also the case with read/write separation, where requests to read/write access separate groups of machines so as not to interfere with each other. In addition, there is the separation of DB dimensions.

4. Life preservation measures

Mechanism control, monitoring alarm and sorting out system risk points are all to prevent the occurrence of problems. However, often in the river walk where not wet shoes? If there’s a problem on the wire, how do you quickly stop the bleeding? At this point, you need to have some life-saving measures in place on a daily basis. If you wait until a problem occurs, it may be too late to prepare for it.

So what are the life-saving measures?

1) Limit current. Although limited flow will cause some requests to fail, but it will reduce the pressure on the service, relieve the pressure on the DB, in some cases can be used to save life. Common current limiting tools are Sentinel. In daily work, the current limiting value can be configured in the Sentinel console. When a problem occurs, the current limiting switch can be pushed directly. You can limit traffic for a cluster or for a single machine, depending on the specific business scenario and requirements. You can limit flow to all application sources, or you can configure limits for specific applications, depending on your requirements.

2) Degradation. Downgrades can be classified as lossy downgrades and lossless downgrades. Lossy downgrades are business-aware, while lossless downgrades are only technical, such as queries being downgraded from one data source to another standby data source without business-aware downgrades. Destructive degradation, a typical example is the Spring Festival Gala grab the red envelope activity, Baidu directly notice the New Year’s eve, Baidu cloud disk login registration downgrade, in order to ensure the Spring Festival Gala grab the red envelope activity smoothly. For lossy degradation, it should be well defined in daily work under what circumstances and what indexes the system has reached before lossy degradation can be carried out. In order to avoid the occurrence of problems, too little consideration, direct downgrade, resulting in more serious problems.

3) Tangential flow. When the server of a computer room has network problems or hardware problems or other problems that cause the computer room is not available, the flow can be cut to other computer rooms. When a data source is not available, you can also cut streams to the standby data source. These can either reduce the scope of the problem, or solve the problem.

4) Distribution control. If there is a problem with the service that the external users are dealing with, they can notify the external users through the business side or other ways, telling them that the system has a problem at this time, and the relevant students are solving it, so as to appease them, and not let the external users face overwhelmed, at a loss. Some official accounts inform users through microblogs, which is also a form of control.

5) Contact information of upstream and downstream, DBA, etc. Sometimes it may not be the problem of your own system, but the abnormality of the upstream and downstream systems, which indirectly leads to the decline of the success rate of your system. At this time, it is very important to master the contact information of the upstream and downstream partners, so that you can inform them quickly and ask them to intervene in the investigation. Sometimes there may be a problem with the DB, which needs the help of the DBA to solve it. Therefore, it is also very important to master the contact information of the DBA. A small command from the DBA classmate may save you from the fire and water.

Life protection measures such as current limiting, downgrading, cutting flow and layout control all need to clearly define which indexes of the system reach what value before these measures can be implemented. In this way, decisions can be made quickly when a problem occurs, and the problem can not be caused by improper use.

5. Emergency mechanism for online problems

What should I do if there is a real problem on the cable? Don’t panic. The more you panic, the more chaos you get.

First, you need to be able to respond quickly, to show that someone is trying to locate the problem. After the problem is identified, how to stop the bleeding quickly should be placed in the first place. At this point you can use the above measures to protect your life. What if life preservation measures don’t work? There are other ways to do this, and publishing solutions is one way.

One person can’t do it very well. Need to have a communicator and external communication, timely report the current progress; Need to be able to deal with people positioning issues, evaluate influence areas, and make recommendations; There needs to be a decision maker, be it the stability leader within the team, be it the boss, be it the business side, etc., to decide how quickly to stop the bleeding. After a unanimous decision is reached, the processor can implement the decision in accordance with the decision, and the correspondent keeps timely communication with the outside.

6. Failure drill

The purpose of a fault drill is to simulate the real reaction of people when a fault occurs and how to deal with the problem. Therefore, do not inform the team of the fault drill in advance, and do not inform the online problem in advance, otherwise the drill will be meaningless. Of course, the fault drill should also grasp the degree, so as not to affect the online business.

7. Retrospect later

After the occurrence of online problems, no matter how big or small, there should be a corresponding review. Small problems can be a small scope of review, while large problems can further expand the scope of participants in the review.

What are the main contents of review?

1) When is the problem detected?

Determine the time point

2) Which way to find the problem? Monitoring alerts or business feedback?

To determine whether the monitoring alerts are effective and whether there is room for optimization.

3) When to respond to a problem?

To determine if the response is timely and if there is room for optimization

4) When to locate the problem?

To assess familiarity with the system or ability to respond to online problems. Some students may not know enough about the system, so it takes a long time to locate the problem; Some students may be because of things, or under high pressure, psychological quality is not enough, resulting in confusion, so it takes longer than usual. In both cases, the ability to improve is different.

5) Is there any rapid hemostasis measure?

For example, current limiting, degrading, tangential flow, etc. Can see at ordinary times safeguard the job whether reach the position. If you find any missing items, you can record them as an Action and fill them in later. The best way to do this is to simultaneously review the system to see if there are any other quick hemostasis measures that are missing, and do them together.

6) Is the collaboration smooth when upstream and downstream or DBA are required to intervene?

In order to evaluate whether there is room for improvement in collaborative communication. After all, if upstream and downstream or DBA are needed to intervene, it can save a lot of time if we can quickly contact the corresponding classmates and help solve the problem more quickly.

Preached a 8.

Stability is never a matter of one person, it is related to every student, but also every student should bear in mind things.

When connecting the demand, it is necessary to consider whether the demand is reasonable and whether the link is closed loop.

When you write code, you need to consider the robustness of your code, whether you are using the correct syntax, and whether you are writing slow SQL.

At the time of release, you need to consider whether the test passes, whether it is in the release window, whether there is a grayscale policy, how to deal with the problem found during the release, whether it can be rolled back, and so on.

When problems occur online, how to respond quickly is also what every student should know.

Some lectures are necessary to help each student better understand what they should do to ensure the stability of the system.

Well, I am Bella Jang, a girl who writes code in BAT. Welcome to pay attention to my personal WeChat public account (public account: Bella’s technical wheel) to learn and grow together! That’s it for today, see you next time