How to troubleshoot online faults

preface

When it comes to online failures, programmers have all experienced them, and we can improve quickly from troubleshooting and recovery. Step pit many, slowly also became a big ox. This is one of the most popular questions to be asked by big factory interviewers, and interviewers get at least two feedback points from a candidate’s answer to this question. The first is whether the project you are usually in charge of is the core project. If you say that you are in charge of the back management system, it is OK to restart if there is a problem, then you can only go out and turn right. The second is the candidate’s ability to approach problems systematically. How quickly you stopped bleeding; How to quickly locate specific problems step by step; Whether the preparation before a fault is sufficient and whether there is an emergency response plan for the risk point.

Let’s talk about the troubleshooting process of online faults

Rapid hemostatic

In distributed systems, rapid hemostasis is of Paramount importance. Those who have stayed in Internet companies know that the first question in the horrible replay meeting is why the failure took half an hour and the business complained before they knew. Or why did it take 15 minutes to figure out that it was a slow SQL problem that was rubbed on the ground.

The reason is that in distributed systems, failures can easily cause a “domino effect.” For example, a slow SQL request response for an infrastructure service will lead to the accumulation of upstream requests, threads can not be released, and then lead to the online system becomes very slow, there are a lot of errors. The avalanche can be fast, and when the head of testing, operations, and upstream systems is bombarding you with phone calls and messages, you will be

How to do? If the problem has not been located for a long time, you can use the trump card first:

When the restart system is not available, restart the system first to ensure the availability of the system, specific functional problems will continue to locate, restart a few minutes and a large number of errors will be embarrassed
Traffic limiting Traffic limiting configurations for major interfaces must be prepared in advance. You can dynamically change the QPS of interfaces
If there was just a day before the online action, then nine out of ten is caused by the online, this situation if the problem is not checked out can be rolled back first, and then organize a bunch of people to dig the new submitted code
Emergency capacity expansion first the service must be stateless, support dynamic scaling, and the bottleneck must be in the application service, if the bottleneck is in DB or elsewhere

Troubleshooting Process

As mentioned above, if there is an online one day before, and then there is a failure to roll back, it is most likely that the regression test is not thorough, affecting the previous logic, the worst case is a bunch of people to scratch the code line by line. Now we are talking about what to do if the production service is slow and error alarms are increasing.

Service monitoring system high availability design means is a very important service monitoring, the system online can not run naked, otherwise how to die do not know. Amway meituan open source CAT monitoring system, CAT can monitor each indicator, each link event in real time. This includes server CPU load, JVM memory, GC information, thread information, slow URL, slow SQL, average request response time, 95 lines, 99 lines, and the number of service error alarms per unit of time.

Troubleshooting Commands Interviewers often like to ask candidates what troubleshooting commands are available and how to troubleshoot online faults. It’s usually a whole, then a part

1. Query the systemFirst, run the top command to check the overall situation. The most important indicators include Load AVg, CPU usage, CPU and MEM of each process.You can also view the lite version on uptime

2.CPU Use the vmstat tool to query the CPU status. Vmstat contains two parameters: sampling interval time (seconds) and sampling times. Such as:

vmstat -n 2 3 
Copy the code

Indicates that samples are taken every 2 seconds for three times

proces

R Number of processes running and waiting for CPU time slices
B Number of processes waiting for resources

Us indicates the percentage of CPU time consumed by user processes. If the percentage exceeds 50% for a long time, CPU leakage may occur and the program needs to be optimized
Sy percentage of CPU time consumed by the kernel process

3. The memory

free
free -g
free -m
Copy the code

General application free memory/system physical memory <20% indicates insufficient memory and needs to be increased

4. The hard disk

df -h
Copy the code

5. Disk IO

iostat -xdk 2 3
Copy the code

In most scenarios, if the system is slow, one is caused by CPU, and another is caused by disk IO. Common troubleshooting tools use iostat. The parameter meaning is the same as vmstat

6. Network IO

ifstat l
Copy the code

If the ifstat command is not supported and needs to be installed, you can view the IN and out information of each NIC. Observe the network load. Check whether the network status is normal

The CPU usage is high

Run the top command to find the process with the highest CPU usage
What daemon is the highest CPU-consuming process further identified by the ps -ef command or the JPS (Java process) command
Locate the specific thread

Ps -mp Process ID -o THREAD,tid,timeCopy the code

-m displays all threads, -p the CPU usage time, and -o is followed by a custom parameter

Converts the located thread ID to hexadecimal

Printf "%x\n" Thread IDCopy the code

Locate the specific code

Jstat process PID | grep Hexadecimal thread tidCopy the code

The above investigation process readers can run a loop in the local procedures, follow the steps of one by one analysis, will certainly have a harvest

Risk assessment and emergency plan before system on-line

In order to handle faults in an orderly manner, you must make thorough preparations in advance.

System monitoring. Again, online systems can’t run naked
To set the fault level, Ali determines the fault level according to the number of users affected. Different fault levels have different requirements for hemostatic aging, which will increase if hemostasis is not stopped within the specified time and trigger a more advanced process
Fault drill, core interface pressure test is essential, such as the second kill system, how to deal with the instantaneous large flow, how to deal with the redis hang, how to deal with the DB CPU full, should be rehearsed in advance, ready to stop bleeding or degrade the job or script

Fault checking

Finally, to get to the core, every online accident is a big growth for a programmer, and in order to thrive, it is necessary to review afterwards

Common recovery process trilogy

Fault handling process, from the discovery of the fault to the completion of the fault detailed record what time and what were done, from the occurrence of the accident to the resolution of all the details recorded
Failure cause analysis, explain the cause of the failure and analysis report, a simple point is the person responsible for the accident
Follow-up rectification TODO

conclusion

The troubleshooting process is finished. What are the most impressive failures you have encountered in your daily work and how to answer them afterwards? Welcome to leave a message. One of my colleagues tells a funny story about troubleshooting at a former employer. In the middle of the night after the system cutover online, the flow gray scale switch over after a long time after the system began to crazy alarm, and then a bunch of big guy sitting behind his colleagues looked at him to deal with the fault, colleagues nervous query database SELECT words are not right.

Thank you for attention

You can pay attention to the wechat public account “Roll back the code”, read the first time, the article continues to update; Focus on Java source code, architecture, algorithms, and interviews.

preface

Rapid hemostatic

Troubleshooting Process

The CPU usage is high

Risk assessment and emergency plan before system on-line

Fault checking

conclusion

Thank you for attention

Related Posts

Boxx: A library of tools designed to make Python code development and debugging more efficient, especially in the field of computer vision

Some common Python base methods

Yue Chuang take you live coding games