It took months to figure out why the JVM paused for more than a dozen seconds

Original: Coding diary (wechat official ID: Codelogs), welcome to share, reprint please reserve the source.

Introduction to the

Recently, some strange phenomena occurred in our system, the system would restart in the middle of the night every few weeks, the analysis process took a long time, impressive, so here to record.

First check

This problem is very difficult to troubleshoot because the process field information is lost after the restart. Regular jStack, JMap, arthas etc are not used, only machine monitoring data and logs are used.

When we looked at the machine monitor, we found that CPU and disk IO usage increased at the time of reboot. However, we quickly confirmed that this information was not helpful because JVM startup caused CPU and disk IO to increase due to class loading, GIT compilation, etc. Therefore, this increase was caused by the restart itself, not the cause of the problem.

Then it analyzes the business logs and kernel logs. After a period of analysis, the conclusions are as follows:

During the restart, the system traffic does not increase significantly.
There is no trace of oom-killer in the DMESg logs, although this is often the cause of process restarts.
Since the deployment platform may also kill processes (when they do not respond to health check requests), we also suspect that they may have killed processes by mistake, but the o&M students believe that this is not possible.

There is no further progress on the problem. Since there is no process on site, I developed a script to troubleshoot the problem. The main logic is to monitor CPU and memory usage, and automatically collect JSTACK and JMAP information when it reaches a threshold.

Second screening

After deploying the script, the process restarted a few hours later, but this time not in the middle of the night, but in the daytime, and the process began again…

During this sweep, I suddenly realized that I had missed checking the GC log. I quickly opened the GC log and looked at it, and found the following output:

Heap after GC invocations=15036 (full 0): garbage-first heap total 10485760K, used 1457203K [0x0000000540800000, 0x0000000540c05000, 0x00000007c0800000) region size 4096K, 9 young (36864K), 9 survivors (36864K) Metaspace used 185408K, capacity 211096K, committed 214016K, reserved 1236992K class space used 21493K, capacity 25808K, committed 26368K, Reserved 1048576K} [Times: user=0.15 sys=0.04, real= 0.06secs] 2021-03-06T9:49:25.564 + 0800:914773.820: Total time reserved 1048576K} [Times: user=0.15 sys=0.04, real= 0.06secs] 2021-03-06T9:49:25.564 + 0800:914773.820: Total timefor whichApplication Threads were stopped: Stopping Threads took: 0.0001795 seconds 2021-03-06 T09: how. 564 + 0800:914773.820: [GC Concurrent-string-Deduplication, 7304.0B-> 7308.0B, AVG 52.0%, 0.0000975 SECS] 2021-03-06T9:50:31.544 + 0800:914839.800: Total timefor whichApplication Threads were stopped: 18.9777012 seconds, Stopping Threads took: 18.9721235 secondsCopy the code

What, what does that mean, that the JVM paused for 18 seconds? But that gc took 0.06 seconds.

Application Threads were Stopped: 18.9777012 seconds application Threads were Stopped: 18.9777012 seconds

This line of logs does represent a JVM pause of 18 seconds, known as STW.
The reason this line of logs exists is because there are-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetailsThese two JVM parameters.
This 18 seconds is not caused by gc. In addition to GC, there is a SafePoint mechanism in the JVM that allows the JVM to enter the STW. Operations such as JStack, JMap, etc., trigger safePoint, which causes all threads to enter SafePoint and wait for the operation to complete.
Generally speaking, if gc is causing STW, it will have one on it[Times: user=0.15 sys=0.04, real=0.06 secs]This line, so you can see that the STW generated by gc above is actually 0.0585911 seconds, rounded to 0.06 seconds.

What is the SafePoint mechanism

Safepoint is a safePoint mechanism designed to block all threads in a safe place when the JVM is performing certain operations (such as Full GC, JStack, jMAP, etc.). Before the thread can resume running. Also, the JVM will bury SafePoint at the following location, which is where the thread has a chance to stop:

Safepoint is buried at the return of the method call
Not counted loop, that iswhile(true)In the dead-loop type, safePoint is buried before each loop jump
A bounded loop, such as a loop variable of type long, safepoint, and int, needs to be added-XX:+UseCountedLoopSafepointsOnly safepoint

After a period of investigation and thinking, IT is confirmed that this STW is caused by the script developed by myself! As the JVM runs for longer, the usage of the old generation will increase, but it will drop after Full GC. However, my script directly detects that the usage of the old generation is greater than 90% of jMAP, which triggers the SAFEpoint mechanism of the JVM and causes all threads to wait for JMAP to complete, resulting in the process not responding to requests. The deployment platform then kills the process.

In fact, the script monitoring logic should be found after Full GC, memory usage is still high, then the memory exception case.

After learning about SafePoint, I searched a lot of articles online, which mainly mentioned 5 sets of JVM parameters, as follows:

The safepoint log will be output to the STANDARD output of the JVM process
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1  
# If a thread enters Safepoint for more than 2000 milliseconds, the Safepoint entry timeout will be considered. Then, the system will print which threads did not enter Safepoint
-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=2000  
# Without this option, the JIT compiler may optimize safePoint at the end of the for loop, so that the thread cannot enter SafePoint until the end of the loop
-XX:+UseCountedLoopSafepoints  
# In high-concurrency applications, biased locking does not improve performance, but triggers many unnecessary Safepoints. It is recommended to add this option to turn off biased locking
-XX:-UseBiasedLocking
# Avoid JVM timing into SafePoint, as in SafePoint, no VM operation is triggered by JVM timing
-XX:+UnlockDiagnosticVMOptions -XX:GuaranteedSafepointInterval=0
Copy the code

Note: By default, the JVM adds safepoint logs to the standard output stream. Due to the resin server, which has a watchdog mechanism, safepoint logs are written to the watchdog process in ${RESIN_HOME}/log/ JVM-app-0.log.

In addition, I found that there were many cases about the long STW caused by -xx :+UseCountedLoopSafepoints and -XX: -usebiasedlocking on the Internet. At that time, I almost thought that the problem would be solved by adding these two parameters.

Instead of further optimizing the monitor script, I dropped it and added these JVM parameters.

Safepoint log format

Immediately after adding the above JVM parameters, view the Safepoint log in the following format:

$ less jvm-app-0.log vmop [threads: total initially_running wait_to_block] [time: Spin block sync cleanup vmop 24.692: ForceSafepoint [77 0 1] [0 0 0 0 0] 0 VMOP [threads: Total Initially_RUNNING wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 25.607: G1IncCollectionPause [ 77 0 0 ] [ 0 0 0 0 418 ] 0 vmop [threads: total initially_running wait_to_block] [time: Spin block sync cleanup vMOP] page_Trap_count 26: Deoptimize [77 0 0] [0 0 0 0 1] 0 VMOP [Threads: Total Initially_RUNNING wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 27.136: ForceSafepoint [ 77 0 1 ] [ 0 0 0 0 0 ] 0 vmop [threads: total initially_running wait_to_block] [time: Spin block sync cleanup vmop] page_trap_count 28.137: No VM operation [77 0 1] [0 0 0 0 0] 0Copy the code

Among them:

The first column is the current log printing time, which is the number of seconds that have elapsed since the process started.
The second column is the operation that triggered the SafePoint mechanism, such as G1IncCollectionPause, which is the pause caused by the G1GC operation.
Total: The number of threads currently running in the JVM when STW occurred initially_RUNNING: The number of threads still running when STW occurred. This is the source of the Spin phase wait_to_block: The number of threads that the STW needs to block is the source of time for the block phase
The fourth column is the time consuming of each safePoint phase

Spin: because the JVM decides to enter a global safepoint while some threads are at the safepoint and some are not, this phase is the time to wait for threads that are not at the safepoint to enter the safepoint.

Block: Even after entering SafePoint, some threads entering SafePoint may still be in the running state, which is the time spent waiting for them to Block.

Sync: equal to Spin + Block + other activity time, last of GC’s STW logStopping threads tookIt’s equal to Spin plus Block.

Cleanup: This phase is where the JVM does some internal Cleanup.

Vmop: Time spent by actual Safepoint operations such as G1IncCollectionPause,Deoptimize,RevokeBias, PrintThreads, GC_HeapInspection(JMAP-histo), HeapDumper(Jmap-dump).

Third screening

After a few weeks, the problem appeared again, and then it was time to check the GC and SafePoint logs. When we looked at the logs, we found that there was a long STW, which was not caused by GC, as follows:

First check the GC log and find that there are more than 16s of STW, which is not caused by GC, as follows:

# 16s of STW found
$ less gc-*.logHeap after GC invocations=1 (full 0): garbage-first heap total 10485760K, used 21971K [0x0000000540800000, 0x0000000540c05000, 0x00000007c0800000) region size 4096K, 6 young (24576K), 6 survivors (24576K) Metaspace used 25664K, capacity 26034K, committed 26496K, reserved 1073152K class space used 3506K, capacity 3651K, committed 3712K, Reserved 1048576K} [Times: user=0.13 sys=0.02, real= 0.04secs] 2021-04-02T00:00:00.192+ 0800:384896.192: Total timefor whichApplication Threads were stopped: Stopping Threads took: 0.0000684 seconds 2021-04-02T00:00:00.193+ 0800:384896.193: Total timefor whichApplication threads were stopped: 0.0006532 seconds, Stopping Threads took: 0.0000582 seconds 2021-04-02T00:00:00.193+ 0800:384896.193: Total timefor whichApplication threads were stopped: 0.0007572 seconds, Stopping Threads took: 0.0000582 seconds 2021-04-02T00:00:00.194+ 0800:384896.194: Total timefor whichApplication threads were stopped: 0.0006226 seconds, Stopping Threads took: 0.0000665 seconds 2021-04-02T00:00:00.318+ 0800:384896.318: Total timefor whichStopping Threads took: Application Threads were stopped: 0.1240032 seconds 0.0000535 seconds 2021-04-02T00:00:00.442+ 0800:384896.442: Total timefor whichApplication Threads were stopped: 0.1240013 seconds, Stopping Threads took: 0.0007532 seconds 2021-04-02T00:16.544 + 0800:384912.544: Total timefor whichApplication Threads were stopped: Stopping Threads took: 0.0000465 seconds 2021-04-02T13:04:48.545+ 0800:384912.545: Total timefor whichApplication Threads were Stopped: 0.00232 seconds, Stopping Threads took: 0.0000462 secondsCopy the code

The SafePoint logs show that safePoint operations are performed for 16 seconds and HeapWalkOperation is triggered as follows:

The # safepoint log also found 16s of HeapWalkOperation$ less jvm-app-0.log vmop [threads: total initially_running wait_to_block] [time: Spin block sync cleanup vMOP] page_trap_count 384896.193: FindDeadlocks [96 0 0] [0 0 0 0 0] 0 vMOP [threads: Total Initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 384896.193: ForceSafepoint [ 98 0 0 ] [ 0 0 0 0 0 ] 0 vmop [threads: total initially_running wait_to_block] [time: Spin block sync cleanup vmop 384896.194: ThreadDump [98 0 0] [0 0 0 0 124] 0 vmop [threads: Total Initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 384896.318: ThreadDump [ 98 0 0 ] [ 0 0 0 0 124 ] 0 vmop [threads: total initially_running wait_to_block] [time: Spin block sync cleanup vmop] page_trap_count 384896.442: HeapWalkOperation [ 98 0 0 ] [ 0 0 0 0 16102 ] 0 vmop [threads: total initially_running wait_to_block] [time: Spin block sync cleanup vmop] page_trap_count 384912.545: No VM operation [98 0 0] [0 0 0 0 0] 0Copy the code

Comparing the time points of the two logs, it is found that the time points are consistent as follows:

Check the start time of the process
$ ps h -o lstart -C java|xargs -i date -d '{}' +'%F %T'The 2021-03-28 13:05:03# watchdog processThe 2021-03-28 13:05:04This is the server process

# Since safePoint records the point in time relative to the number of seconds the process started, HeapWalkOperation corresponds to 384896.442 seconds
Date = xx seconds ago
$ date -d '2021-03-28 13:05:04 384896 seconds'  +'%F %T'
2021-04-02 00:00:00
Copy the code

It can be found that STW in GC log is 2021-04-02T00:00:16, while it is 2021-04-02 00:00:00 in SafePoint, just 16s difference, time difference is just equal to STW time, because GC log records the time after STW occurs. Safepoint logs record the time before THE OCCURRENCE of STW, so the two log time points are consistent. The STW of 16s is caused by HeapWalkOperation.

The name looks like a heap traversal operation, similar to jmap, but my script has been dropped, there is no possible Jmap operation, in addition to my resin server process, there is no other process on the machine!

Here, has found a part of the reason, but do not know how to cause, struggling to find the root cause…

The NTH sweep

I can’t remember how many times the investigation was conducted. Anyway, the problem appeared several times, but this time I found the root cause. The process is as follows:

As in the previous procedure, check the GC log and safepoint log as follows:

14s of STW was found in gc log
$ less gc-*.logThe 2021-05-16 T00:00:14. 634 + 0800:324412.634: Total timefor whichApplication Threads were stopped: 14.1570012 seconds, Stopping Threads took: 0.0000672 secondsThe safepoint log also contains 14s of HeapWalkOperation$ less jvm-app-0.log vmop [threads: total initially_running wait_to_block] [time: Spin block sync cleanup vmop] page_trap_count 324398.477: HeapWalkOperation [98 0 0] [0 0 0 14157] 0Copy the code

The phenomenon is exactly the same as before, now the key is not to know what causes the HeapWalkOperation.

After learning Linux commands and summarizing the use of grep, I recursively grep heap in resin directory, as follows:

# -i ignores case search
# -r recursively searches heap for all files in the current directory/subdirectory/subdirectory
# -n Prints the line number of the matching line
$ grep -irn heap
Copy the code

Unexpectedly, I found HeapDump configuration in resin, which seems to be some health check mechanism in resin.

After studying resin’s official website, it is confirmed that resin has various health inspection mechanisms. For example, at 0 o ‘clock every week, a PDF report will be generated. The data of this report comes from operations like JStack and Jmap, but it is realized by calling some methods of JDK. Resin health check mechanism introduction: www.caucho.com/resin-4.0/a…

  <health:PdfReport>
    <path>${resin.root}/doc/admin/pdf-gen.php</path>
    <report>Summary</report>
    <period>7D</period>
    <snapshot/>
    <mail-to>${email}</mail-to>
    <mail-from>${email_from}</mail-from>
    <! -- <profile-time>60s</profile-time> -->

    <health:IfCron value="0, 0 * * 0"/>
  </health:PdfReport>
Copy the code

This mechanism generates a PDF report in the ${RESIN_HOME}/log directory as follows:

$ ls -l
-rw-r--r-- 1 work work 2539860 2021-05-02 00:00:26 app-0-Summary-20210502T0000.pdf
-rw-rw-r-- 1 work work 3383712 2021-05-09 00:00:11 app-0-Summary-20210509T0000.pdf
-rw-rw-r-- 1 work work 1814296 2021-05-16 00:00:16 app-0-Summary-20210516T0000.pdf
Copy the code

Sometimes the pause time triggers the kill process threshold of the deployment platform, and sometimes it does not. Therefore, our restart is not at 0 points every week. So that the time pattern of 0 o ‘clock is not noticed.

So I directly found resin’s health. XML and closed all health check mechanisms as follows:

<health:ActionSequence>
    <health:IfHealthCritical time="2m"/>

    <health:FailSafeRestart timeout="10m"/>
    <health:DumpJmx/>
    <! -- <health:DumpThreads/> -->     <! Comment this out -->
    <health:ScoreboardReport/>
    <! -- <health:DumpHeap/> -->        <! Comment this out -->
    <! -- <health:DumpHeap hprof="true" hprof-path="${resin.logDirectory}/heap.hprof"/> -->   <! Comment this out -->
    <health:StartProfiler active-time="2m" wait="true"/> 
    <health:Restart/>
</health:ActionSequence>

<health:MemoryTenuredHealthCheck>
    <enabled>false</enabled>            <! -- Add this -->
    <memory-free-min>1m</memory-free-min>
</health:MemoryTenuredHealthCheck>
  
<health:MemoryPermGenHealthCheck>
    <enabled>false</enabled>            <! -- Add this -->
    <memory-free-min>1m</memory-free-min>
</health:MemoryPermGenHealthCheck>
  
<health:CpuHealthCheck>
    <enabled>false</enabled>           <! -- Add this -->
    <warning-threshold>95</warning-threshold>
    <! -- <critical-threshold>99</critical-threshold> -->
</health:CpuHealthCheck>

<health:DumpThreads>
    <health:Or>
      <health:IfHealthWarning healthCheck="${cpuHealthCheck}" time="2m"/>
      <! -- <health:IfHealthEvent regexp="caucho.thread"/> -->  <! This event is defined by < Health :AnomalyAnalyzer> -->
    </health:Or>
    <health:IfNotRecent time="15m"/>
</health:DumpThreads>

<health:JvmDeadlockHealthCheck>
    <enabled>false</enabled>            <! -- Add this -->
<health:JvmDeadlockHealthCheck/>

<health:DumpJmx async="true">       <! -- Delete this node -->
    <health:OnStart/>
</health:DumpJmx>

<health:Snapshot>   <! -- Delete this node -->
    <health:OnAbnormalStop/>
</health:Snapshot>

<health:AnomalyAnalyzer>  <! -- Delete this node -->

<health:PdfReport>  <! -- Delete this node -->
Copy the code

After this configuration, two months later, there is no restart, confirming that the problem is resolved.

conclusion

There are certain ideas in the problem investigation this time, but the chance of finding the root cause is still a bit like hitting luck. I randomly grep a handful of clues to find, but I don’t know whether I will encounter such luck next time.

Later, I thought about it. The conventional troubleshooting idea of this kind of problem is to suspend the script, run the tracing program Perf and bPFTrace, and add probe into the entry function of SAFEPoint operation in JVM. When safePoint operation time is found to be too long, Linux command pick-up-profile-tool is a tool that can be used to trace commands by using the JVM’s native call stack and Java call stack

Content of the past

Hex,base64,urlencode (timestamp); Really understand the use of regular expressions for repeatable read transaction isolation levels that are not easy to figure out on your own