Routine troubleshooting – Call timeout

preface

The daily Bug check series is a series of simple Bug check, here I will introduce some simple Bug check tips, and also accumulate material ^_^.

At the scene of the Bug

The Bug is everyone’s favorite call timeout. So A calls B more than 5s



Searching the log, it was found that system A timed out after sending 5s. System B received the request nearly 8s later, that is, system A timed out before system B started processing it.

Start screening

So where did those five seconds go? There are three possible points:

1)A logs to the request packet that is actually sent 2) network 3)B logs to the request packet that is actually received by B.Copy the code

The network check

First, the author checked the Net Traffic of the machine at that time, and found that it was very stable, considering that it was not the pot of network.

Full GC

For Java applications, The second point of consideration should be GC, after all, Stop The World! The author searched the two corresponding systems A/B and found that system A okay, system B had Full GC at that time, and the length was 6s:

Since it has been monitored, the problem is basically the Full GC of system B, which lasts for 6s and makes system B print the request after 5s. However, this introduces a new question, why a Full GC can reach 6s size.

Why so slow?

Observing the monitoring, the author found that Full GC is sometimes fast and sometimes slow. Pull out the GC monitor log for 6s.

B System [Full GC(Metadata GC Thresold)...... (class class semantics,5.5285249 secs]...... [Times: user sys = 0.85 = 0.07 real = 6.26 secs] class unloading...Copy the code

Class class has nearly 5s. Further filter with AWK, the highest 10s, the shortest 0.1s, and they recover the same size of memory. Normal Full GC should not take so long, that 0.1s felt normal, is there something wrong with the machine at that time? With doubt, the author continued to observe the monitoring curve to see if he could find some clues, find the time point at that time, and find:



When GC is slow, the swap in of machine memory is high. Then I found another slow Full GC. Find very regular, as long as the swap in high Full GC is slow!

So the author, tried to search

https://blogs.oracle.com/poonam/long-class-unloading-pauses-with-jdk8 
Copy the code

Found, the official also found this problem, and gave an explanation.

Why swap

In fact, the memory usage of the corresponding machine is not very high, with the JVM only taking up about 4 gb of memory out of 8 GB. But the logic of swap is not to use swap only when memory is tight. If a piece of memory is not used for a long time, it may be swapped to swap.

For example, the JVM’s class information, if a class MetaData just exists and is not used. The kernel may throw it into swap. However, when GC reachability analysis is performed, the MetaData information will be accessed again, resulting in low memory utilization, but the use of swap is still slow.

But it’s hard to know exactly what’s in swap and whether it’s relevant to the JVM. So it looks like there’s a probability that GC is slow.

The other machine room is clean

Coincidentally, business development reported to me that the same application in another machine room would not have this problem. Class semantics is 0.9s or so. The author observed and found that the machine in another room did not use swap. Therefore, the author compares the kernel parameters related to swap between the two computer rooms:

GC On slow machines cat /proc/sys/vmp/swappiness 60 GC On normal machines cat /proc/sys/vmp/swappiness 1Copy the code

We found that our newly built computer room, our SA has set swappiness to 1 in advance, which means to tell the kernel not to use swap as far as possible, so that there will be no problems caused by swap.

conclusion

For non-memory bottleneck applications, we should decide whether to disable swap based on the actual situation, so as not to cause swap lag! In addition, for an occasional problem, we should find the rule through monitoring and other means, so that it is easy to find a breakthrough.