1. Background

Ops E-mail received yesterday afternoon suddenly alarm, display data platform server CPU utilization rate reached 98.94%, and continued in recent a period of time by more than 70%, looks like the hardware resources to the bottleneck need to increase, but think carefully you will find our business system is not a high concurrency or CPU intensive applications, This utilization is a bit excessive, the hardware bottleneck should not be reached so soon, there must be a problem with the business code logic somewhere.

2. Check your thinking

2.1 Locating the PID of a Heavily loaded Process

Log in to the server and run the top command to check the server status.

By observing load Average and load evaluation criteria (8 cores), it can be confirmed that the server has a high load.

By observing the resource usage of each process, you can see that process 682 has a high CPU usage.

2.2 Locating abnormal Services

Here we can use the PWDX command to find the business process path according to pid, and then locate the responsible person and the project:

It can be concluded that this process corresponds to the Web service of the data platform.

2.3 Locating abnormal threads and specific code lines

The traditional plan usually has 4 steps:

1, top oder by with P: 1040

Top-hp process PID: 1073 // Find the related load thread PID

Printf “0x%x\n” thread PID: 0x431 // Convert thread PID to hexadecimal for later search jStack log

4, jstack process PID | vim + / hex threads PID / /, for example: 1040 | jstack vim + / 0 x431 –

Oldratlee, who introduced Taobao before, packaged the above process into a tool: show-busy- Java-threads. Sh, which can easily locate online problems:

It can be concluded that the execution CPU of a time tool class method in the system is relatively high. After locating the specific method, check whether the code logic has performance problems.

※ If the online problem is urgent, you can omit 2.1 and 2.2 and directly execute 2.3. The analysis from multiple perspectives here is just to present a complete analysis idea to you.

3. Root cause analysis

After the previous analysis and troubleshooting, we finally locate a time tool class problem, which causes high server load and CPU usage.

1, exception method logic: it is to convert the timestamp into the corresponding specific date and time format;

2. Upper level call: calculate all the seconds from dawn of the day to the current time, convert them into the corresponding format and put them into set to return the result;

3. Logic layer: It corresponds to the query logic of the real-time report of the data platform. The real-time report will come at a fixed time interval, and there will be multiple (n) method calls in a query.

Then it can be concluded that if the time is 10:00 a.m. on that day, the number of calculations for a query is 10*60*60* N =36,000*n, and as time increases, the closer to midnight, the number of calculations for a query will increase linearly. Because a large number of query requests of real-time query and real-time alarm modules need to call this method for many times, resulting in the occupation and waste of a large number of CPU resources.

4. Solutions

After locating the problem, the first consideration is to reduce the calculation times and optimize the abnormal method. After checking, it was found that the contents of the set returned by this method were not used in the logical layer, but the size value of the set was simply used. After confirming the logic, simplify the calculation with a new method (the current number of seconds – the number of seconds in the morning of the day), replace the method called, and solve the problem of too much calculation. After going online, the server load and CPU usage were observed. Compared with the abnormal period, the load decreased by 30 times and returned to normal state. So far, the problem has been solved.

5, summary

1. In the process of coding, in addition to realizing business logic, we should also pay attention to optimizing code performance. In fact, the realization of a business requirement and the realization of more efficient and elegant are two completely different engineers’ abilities and realm, and the latter is also the core competitiveness of engineers.

2. After the code is written, review more and think more about whether it can be implemented in a better way.

3. Don’t let go of any little details in online questions! Details are the devil, technical students need to have the thirst for knowledge and the pursuit of excellence, only in this way, can continue to grow and improve. (Author/ Zhang Yecheng)

This article is reprinted from the public account “Python Operations circle”