The introduction

After learning the JVM knowledge in the preceding nine chapters, we have a comprehensive understanding of the overall knowledge system of JVM. While the previous chapters were more theoretical, this chapter will look at the JVM in action. Of course, with the support of previous theoretical knowledge, we can quickly locate faults and performance bottlenecks in the online environment, and also help us solve the “problems” encountered more quickly.

Online screening, performance optimization concept is “regulars” of the interview process, for the online encounter “incurable diseases”, through rational thinking to analyze and troubleshoot problems, positioning problem, solve the problem, at the same time, if the problem solved or the bottleneck, also can be in the range of ability to try the optimal solution and due consideration to expand.

This chapter begins with an overview of online troubleshooting, followed by an overview of troubleshooting tools commonly used by the JVM, and finally a comprehensive analysis of some of the most common online failures encountered by the JVM.

I. Analysis of common faults and troubleshooting ideas in JVM online environment

During development, there are often a variety of native visualization tools to support viewing if you encounter JVM problems. However, programs written in the development environment will be deployed on the production server sooner or later, and the online environment is also prone to occasional unexpected problems, such as the following problems in the online JVM environment:

  • ①JVM memory leaks.
  • ②JVM memory overflow.
  • ③ Service threads are deadlocked.
  • ④ The application is abnormally down.
  • ⑤ Thread blocking/slow response.
  • ⑥CPU utilization spikes or 100%.

When the program online environment failure, unlike the development environment, can be monitored by visual tools, debugging, online environment is often “bad” a lot, then when encountered this kind of problem and how to deal with? First of all, when encountering such failure problems, we must have good troubleshooting ideas, and then based on theoretical knowledge, through the support of experience and data analysis to solve them in turn.

1.1 online investigation and the idea of solving problems

Relatively speaking, whether it is troubleshooting or performance bottlenecks, the ideas are generally the same, and the steps are as follows:

  • ① Analysis of problems: according to theoretical knowledge + experience analysis of problems, judge the possible location of problems or possible causes of problems, reduce the target to a certain range.
  • (2) Troubleshooting: Based on the results of the previous step, from the point of view of the “suspicious” causing the problem, the investigation is carried out from high to low, and some options are further excluded to further narrow the target range.
  • (3) Locating the problem: with the assistance of relevant monitoring data, the cause of the problem is located to the precise position in a more “fine-grained” way.
  • (4) solve the problem: judge the specific location of the problem and cause, take relevant measures to solve the problem.
  • ⑤ Try the optimal solution (not necessary) : after the original problem is solved, the optimal solution of the problem should be considered properly (from the perspective of performance, scalability, concurrency, etc.) within the scope of ability, and the environment allows.

Of course, the above process is for specific problems and experienced developers, as “new-age application builders”, who must learn to use tools properly to help us solve problems quickly:

  • ① Extract or copy the key fragments of the problem.
  • ② Open Baidu or Google after paste search.
  • ③ Observe the returned results and select the data whose title and description match your problem.
  • ④ Look at a few more and try to solve the problem according to the solution.
  • ⑤ Everyone is happy after a successful solution, and “find someone/ask a group” after trying without success.
  • ⑥ “external force” can not solve the problem by yourself, according to the previous steps to solve the troubleshooting.

After all, when someone asks you how to solve a problem in an interview, you can’t say you rely on Baidu. It is also important to understand that “the information that can be searched is also written by people, so why can’t you be the person who wrote it?”

1.2 Direction of online investigation

Generally speaking, a fault occurs when the system is deployed online. After analysis and troubleshooting, the root causes of the fault are as follows:

  • Problems caused by the application itself
    • GC is frequently triggered within the program, resulting in long pauses in the system and a flood of requests from clients.
    • JVM parameters are not properly configured, resulting in online running out of control, such as heap memory, each memory area is too small, etc.
    • Java program code is defective, resulting in online running bugs such as deadlocks/memory leaks, overflows, and so on.
    • Improper use of resources within the program, resulting in problems such as threads /DB connections/network connections/out-of-heap memory, etc.
  • Problems caused by upstream and downstream internal systems
    • Concurrency occurred in upstream services, resulting in a sharp increase in the number of requests for the current program, causing problems that dragged down the system.
    • Downstream service problems, resulting in the current program piled up a large number of requests to drag down the system, such as Redis down /DB blocking.
  • Problems caused by the machine on which the program is deployed
    • A network fault occurs in the server equipment room. As a result, the network is blocked and the current program feets dead.
    • The server is unavailable due to other programs, hardware problems, or environmental factors (such as power failure).
    • Java programs, such as Trojans, mining machines, and hijacked scripts, are affected due to server intrusion.
  • A problem caused by a remote RPC call from a third party
    • As the called to the third party to call, the third party traffic surge, resulting in the current program overload problem.
    • Call a third party as caller, but the current program crashes because the third party has a problem, causing an avalanche problem.

All changes are unavoidable. Although all possible problems are not written in the above, in general, when problems occur, that is, these big directions, the general positioning of the problem will occur first, and then gradually deduce the location of specific problems, so as to solve them.

Java provides program monitoring and performance tuning tools

When encountering a problem, the first thing to do is to locate the problem. The general problem location is based on data, such as program run logs, exception stack information, GC log records, thread snapshot files, heap memory snapshot files, etc. At the same time, data collection is inseparable from the assistance of monitoring tools, so when problems occur during the running of JVM online, it is natural to use some tools provided by JDK and third parties, such as: JPS, Jstat, jStack, JMap, JHat, Hprof, jinfo, arthas, etc. Let’s take a look at each of these tools.

JPS, jstat, jstack, jmap, jhat, jinfo and other commands are tools that come with JDK installation. Their functions are mainly implemented by calling Java methods in the %JAVA_HOME%/lib/tools.jar package. So if you want to build your own JVM monitoring system, you can do it by calling the jar package method inside your Java program. If you do not know how to use these tools, you can also use the tool -help parameter to see how to use them, such as JPS -help.

PS: What you know about the tools provided by the JDK can skip to stage 3.

2.1 process monitoring tool – JPS

Is the main purpose of the JPS tool used to check the machine running on the Java process, similar to the Linux ps – aux | grep Java command. The JPS tool also supports viewing Java processes on other machines. The command format is as follows:

JPS [options] [hostid] JPS -help

Options are -q, -m, -l, -v, and -v:

  • jps -q: View all running Java processes on the machine, but only the process number (lvmid).
  • jps -m: ~ displays only the message passed tomainMethod parameters.
  • jps -lDisplay only the package name of the main class of the running program, or the running programjarThe full path to the package.
  • jps -v: ~, separately displays the parameters specified explicitly when the JVM is started.
  • jps -V: ~ displays the main class name or jar package name.

Hostid is the remote ID used to connect to another machine to view the Java process.

JPS tool Actual usage: JPS [PID].

2.2 Configuration information viewing tool – jinfo

The JPS tool is mainly used to view the JVM’s running parameters in real time, and can also adjust some parameters dynamically at run time. The command format is as follows:

Jinfo [option1] [option2] View command usage: jinfo-help/jinfo-h

The options of [option1] are as follows:

  • <no option>By default, all parameters of the JVM and system properties are printed.
  • -flag <name>: Output with the specified name<name>All parameters and their values.
  • -flag [+|-]<name>: On or off with the specified name<name>Corresponding parameter.
  • -flag <name>=<value>: Sets and specifies the name<name>Value of the corresponding parameter.
  • -flags: Outputs all parameters of the JVM.
  • -sysprops: Outputs all system attributes of the JVM.

The options of [option2] are as follows:

  • <pid>: Specifies the corresponding JVM process ID (required)jinfoThe Java process to operate on.
  • executable <core: Prints the core file that prints the stack trace.
  • [server-id@]<remote server IP or hostname>: Address for remote operation.
    • server-id: remotedebugThe process ID of the service;
    • remote server IP/hostname: remotedebugThe host name or IP address of the service;

The actual usage of the Jinfo tool is jinfo-flags [pid].

PS: The effect of each option is no longer demonstrated, interested partners can open a local Java process, and then use the above options for debugging observation.

2.3 Information statistics monitoring tool – Jstat

Jstat is the Java Virtual Machine Statistics Monitoring Tool. It can use JVM built-in commands to monitor Java program resources and performance in real time. The monitoring scope includes: Data areas of heap space, garbage collection status, and loading and unloading status of classes.

Command format: jstat – < option > [-t] [-h < lines >] < vmid > / < interval > / < count >]]

Each parameter is interpreted as follows:

  • [option]: Monitoring parameter options.
  • -t: Add to the output resultTimestampColumn to show the time the system is running.
  • -h: You can output the table header at a specified interval when the periodic data is output.
  • vmid:Virtual Machine IDThe virtual ID, which specifies the ID of a Java process to monitor.
  • interval: Interval for each execution. The default unit isms.
  • count: Specifies how many pieces of data to output. By default, the output is always.

After executing the jstat -option command, you can see that there are many options as follows:

  • -class: Output class loadingClassLoadRelevant information.
  • -compiler: Displays JIT just-in-time compilation related information.
  • -gc: Displays information related to GC.
  • -gccapacity: Displays the capacity and usage of each generation space.
  • -gcmetacapacity: Displays metadata space-related information.
  • -gcnew: Displays Cenozoic space-related information.
  • -gcnewcapacity: Displays the capacity and usage of the new generation space.
  • -gcold: Displays information about the space of the aged generation.
  • -gcoldcapacity: Displays the capacity and usage of the tenured space.
  • -gcutil: Displays garbage collection information.
  • -gccauseAnd:-gcutilFunctions the same, but outputs additional triggers for the last or current GC.
  • -printcompilation: Prints JIT just-in-time compilation method information.

So jstat is actually used as follows:

Jstat -gc-t -h30 9895 1s 300 -gc: monitors the GC status. -t: displays the system running time. -h30: displays the table header at an interval of 30 rows

The final execution result is as follows:



The meanings of the fields in the statistics column are as follows:

The field names Field definitions
Timestamp System running time
S0C The first oneSurvivorThe total capacity of an area
S1C The secondSurvivorThe total capacity of an area
S0U The secondSurvivorUsed size of extents
S1U The secondSurvivorUsed size of extents
EC EdenThe total capacity of an area
EC EdenUsed size of extents
OC OldThe total capacity of an area
OU OldUsed size of extents
MC MetaspaceThe total capacity of an area
MU MetaspaceUsed size of extents
CCSC CompressedClassSpaceThe total size of the space
CCSU CompressedClassSpaceThe amount of space used
YGC The number of Cenozoic GC occurrences from the start of the program to the time of sampling
YGCT Total time of Cenozoic GC from program start to sampling
FGC The number of old GC’s that occurred between the start of the program and the time of sampling
FGCT Total GC time of the old generation from the start of the program to the time of sampling
GCT The total number of GC occurrences in the program from the start of the program to the time of sampling

If [options] specifies other options, different columns will appear as follows:

The field names Field definitions
S0 The first oneSurvivorUtilization rate of district (S0U/S0C)
S1 The secondSurvivorUtilization rate of district (S1U/S1C)
E EdenUtilization rate of district (EU/EC)
O OldUtilization rate of district (OU/OC)
M MetaspaceUtilization rate of district (MU/MC)
CCS CompressedClassSpaceUtilization rate of district (CCSU/CCSC)
NGCMN Initial capacity of Cenozoic space
NGCMX Cenozoic space maximum capacity
S0CMN The first oneSurvivorInitial area capacity
S0CMX The first oneSurvivorMaximum area capacity
S1CMN The secondSurvivorInitial area capacity
S1CMX The secondSurvivorMaximum area capacity
OGCMN The initial capacity of the aged space
OGCMX Maximum capacity of tenured generation space
MCMN Initial capacity of the metadata space
MCMX Maximum capacity of metadata space
CCSMN Class compression space initial capacity
CCSMX Maximum capacity of class compression space
TT Minimum age threshold for object promotion
MTT The maximum age threshold for object promotion
DSS The desiredSurvivorArea total size

CCS is the space used to store a compressed pointer to a class.

In addition to heap space and GC related statistical column information,jstatThe tool can also monitor the status of class loading and unloading, JIT compilation, and executionjstat -class [pid],jstat -compiler [pid]Command, the effect is as follows:



Interpretation of monitoring data statistics column fields related to class loading and unloading:

The field names Field definitions
Loaded The number of classes that the JVM has loaded
Bytes The size of the loaded class in bytes
Unloaded The number of classes that have been unloaded
Bytes Size of bytes occupied by unloaded classes
Time Total time spent unloading and loading classes

JIT just-in-time compilation related monitoring data column field interpretation:

The field names Field definitions
Compiled The total number of compile tasks executed
Failed Number of failed compilation tasks
Invalid Number of compile task execution failures
Bytes Size of bytes occupied by unloaded classes
Time Total time taken for all compile tasks
FailedType The last task type that failed to compile
FailedMethod The class and method of the task that last failed to compile

After executing different commands for the Jstat tool, the meanings of each statistical column are clearly explained in the preceding section. If you use the Jstat tool to troubleshoot performance bottlenecks in the online environment, you can refer to the above definitions for the statistical columns you do not understand.

2.4 Heap memory statistical analysis tool – JMAP

Jmap is a versatile tool for viewing heap space usage. It is usually used in conjunction with the JHAT tool, which can be used to generate Dump files for Java heaps. But in addition, you can also view finalize queues, details of metadata space, object statistics in the Java heap, such as usage of each partition, currently assembled GC collector, etc.

Jmap [option1] [option2]

Where [option1] can be:

  • [no option]: Displays the memory image information of the process, andSolaris pmapSimilar.
  • -heap: Displays detailed information about the Java heap space.
  • -histo[:live]: Displays statistics about objects in the Java heap.
  • -clstats: Displays classloading-related information.
  • -finalizerinfo: displayF-QueueWaiting in a queueFinalizerThreads executefinalizerMethod object.
  • -dump:<dump-options>: Generates a heap dump snapshot.
  • -F: When normal-dumpand-histoIn case of execution failure, the prefix is added-FIt can be enforced.
  • -help: Displays the help information.
  • -J<flag>: Specifies passing to the runjmapJVM parameters of.

[option2] is similar to jinfo. The options are as follows:

  • <pid>: Specifies the corresponding JVM process ID (required)jinfoThe Java process to operate on.
  • executable <core: Prints the core file that prints the stack trace.
  • [server-id@]<remote server IP or hostname>: Address for remote operation.
    • server-id: remotedebugThe process ID of the service;
    • remote server IP/hostname: remotedebugThe host name or IP address of the service;

The actual usage of the jmap tool can be jmap-clSTATS [PID] or jmap-dump :live,format=b,file= dump. phrof [pid]. Commands for exporting heap snapshots are as follows: live: exports snapshots of living objects in the heap. Format: Specifies the output format. File: Specifies the output file name and its format (such as. Dat and. Phrof).

Of course, the specific effect of each option is no longer demonstrated, you can debug their own observation.

But it’s worth mentioning: Most of the TOOLS provided by the JDK communicate with the JVM through the Attach mechanism, which can perform operations for the target JVM process, such as memory Dump, thread Dump, class statistics, dynamic loading Agent, dynamic setting of JVM parameters, printing JVM parameters, system properties, etc. If you are interested, you can go to further study. The specific source code is located in com.sun.tools.attach package, which contains a series of code related to attach mechanism.

In the end forhistoOption to do a simple debugging,histoThis option displays statistics about objects in the heap space, including the number of object instances and memory usage. Because in thehisto:liveBefore will beFullGC, so take theliveOnly surviving objects are counted. So, noliveThe heap size is greater than the heap sizeliveThe size of the heap (because of the bandliveWill force to trigger onceFullGC), as follows:



In the figure above,class nameIs the type of the object, but some are abbreviations. Object abbreviations are compared with real types as follows:

Abbreviations type B C D F I J Z [ L + type
True type byte char double float int long boolean An array of Other objects

2.5. Heap memory snapshot analysis tool – Jhat

The JHAT tool is used together with the JMap tool to analyze Dump files exported by the JMap tool. The jHAT tool also has a miniature HTTP/HTML server embedded in the Dump file. Therefore, after analyzing Dump files, the JHAT tool can view the analysis results in the browser. However, jHAT is not used to parse Dump files directly in online environments, because jHAT parsing Dump files, especially large Dump files, is time-consuming and consumes hardware resources. Therefore, to avoid occupying too many server resources, Dump files are usually copied to other machines or local files for analysis.

However, jHAT is generally not used locally, because the results generated after analysis are ugly to view through the browser. Tools such as MAT (Eclipse Memory Analyzer), IBM HeapAnalyzer, VisualVM, and Jprofile are generally chosen.

Jhat command format: jhat [-stack

] [-refs

] [-port ] [-baseline

] [-debug

] [-version] [-h|-help]



The jhat directive is a bit long, and there are many parameters you can choose to fill in, which is interpreted as follows:

  • -stackDefault for:trueTo enable object allocation call stack tracing.
  • -refsDefault for:true, whether to enable object reference tracing
  • -portDefault for:7000To set upjhatPort number accessed by the tool browser.
  • -baseline: Specifies the base heap dumpDumpFile in twoDumpIf the same object exists in the file, it will be marked as old object, and different objects will be marked as new object, which is mainly used to compare and analyze two different objectsDumpFile.
  • -debugDefault for:0Set the debug level. 0 indicates that no debugging information is displayed. A larger value indicates more detailed information.
  • -version: Displays the version information.
  • -help: View help information.
  • <file>: To analyzeDumpFile.
  • -J<flag>:jhatThe tool actually starts a JVM process to execute it-JThe directive passes some JVM parameters for the JVM, such as:-J-Xmx128mThis class.

The actual application mode of jhat is jhat heapdump. dat. The results are as follows:

> jmap -dump:live,format=b,file=HeapDump.dat 7452
Dumping heap to HeapDump.dat ...
Heap dump file created

> jhat HeapDump.dat
Reading from HeapDump.dat...
Dump file created Wed Mar 09 14:50:06 CST 2022
Snapshot read, resolving...
Resolving 7818 objects...
Chasing references, expect 1 dots.
Eliminating duplicate references.
Snapshot resolved.
Started HTTP server on port 7000
Server is ready.
Copy the code

Above the process, first throughjmapThe utility exports the memory of the Java heapdumpFile, followed by usejhatTool pair exportdumpAnalyze the file. After the analysis is complete, open the browser and enterhttp://localhost:7000To viewjhatThe results generated after analysis are as follows:



There are a number of options, from top to bottom:

  • ① View object instances of different classes based on the package path.
  • ② View all in the heapRootsA collection of nodes.
  • ③ Check the number of object instances for all classes (including classes of the JVM itself).
  • ④ Check the number of object instances for all classes (excluding the JVM’s own classes).
  • ⑤ View the statistical histogram of instance objects in the Java heap (andjmapObject statistics are about the same.
  • ⑥ Check the JVMfinalizerRelated information.
  • 7) byjhatTool-providedQQLObject query language gets instance information about a specified object.
    • QQLThe specific syntax can be accessed directlyhttp://localhost:7000/oqlhelpLook at it.

In essence, jhat provides a browser interface that is not easy to troubleshoot. Therefore, some more intuitive and convenient tools such as MAT, Jconsole, IBM HeapAnalyzer, visualVm, etc. are usually used to analyze heap Dump files in practice.

2.6 Stack trace tool – jStack

The jStack tool is primarily used to capture a thread snapshot of the JVM at the current moment, which is a stack of methods that each thread in the JVM is executing. When online, the generation of thread snapshot file can be used to locate the cause of a long pause in a thread, such as thread deadlock, loop, request external resources do not respond to the cause, etc.

When a thread pauses, you can use the JStack tool to generate a snapshot of the thread. From the snapshot information, you can view the call stack of each thread in the Java program. From the call stack information, you can clearly see what the thread that pauses is doing and what resources it is waiting for. At the same time, when the Java program crashes, if the parameters are configured and the core file is generated, we can also extract the information related to the Java VIRTUAL machine stack from the core file through the jStack tool, so as to further locate the cause of the program crash.

Jstack command format: jstack [-f] [option1] [option2]

[option1] can be:

  • -l: In addition to displaying stack information, output additional lock-related information (used to troubleshoot deadlock problems).
  • -mAlso displayed if the thread calls a local method in the local method stackC/C++Stack information.

The options of [option2] are as follows:

  • <pid>: Specifies the corresponding JVM process ID (required)jinfoThe Java process to operate on.
  • executable <core: Prints the core file that prints the stack trace.
  • [server-id@]<remote server IP or hostname>: Address for remote operation.
    • server-id: remotedebugThe process ID of the service;
    • remote server IP/hostname: remotedebugThe host name or IP address of the service;

The actual usage of jstack is jstack -l [pid].

In addition, the -f parameter of the jStack tool has the same function as the jmap parameter. When the normal execution fails, the -f parameter can force the execution of the jStack command.

Finally, the Dump logs exported by jStack are worth noting:

state paraphrase
Deadlock The thread is deadlocked
Runnable The thread is executing
Waiting on condition Thread wait resources
Waiting on monitor entry The thread waits to acquire the monitor lock
Suspended Thread to suspend
Object. The wait (), TIMED_WAITING Thread hanging
Blocked Thread block
Parked Thread to stop

2.7 summary of JVM troubleshooting Tools

The above analysis tools are all JDK tools, each with its own role, can be in different dimensions of the JVM runtime health monitoring, can also help us to quickly locate and troubleshoot problems in the online environment. However, in addition to the tools provided by the JDK, there are a number of third-party tools that are very handy to use, such as Arthas, JProfilter, Perfino, Yourkit, Perf4j, JProbe, MAT, Jconsole, visualVm, etc. These tools tend to be more useful and powerful than the JDK tools mentioned in the previous analysis.

Three, JVM online fault “large collection” and troubleshooting combat

After the program goes online, it is undoubtedly a headache to encounter unexpected situations online. However, as a qualified developer, it is not only enough to type out fluent code, but also extra important to be able to arrange errors on the line. However, the ability of on-line fault sorting depends more on the richness of experience. Rich practical experience and theoretical knowledge reserve plus rational fault sorting thoughts are the most important in online troubleshooting.

The next step is to take a thorough look at the most frequent failures that occur in online environments, such as JVM memory leaks, memory spills, business thread deadlocks, application outages, thread blocking/slow response times, and CPU utilization spikes or 100%.

3.1 “Eve” of online investigation

During troubleshooting, the cause of the problem may also come from the upstream and downstream systems. Therefore, when a problem occurs, it is necessary to locate the node with the problem first, and then troubleshoot for the node. But no matter which node (Java application, DB, upstream and downstream Java systems, etc.), there are several reasons for the problem: Code, CPU, disk, memory and network problems, so when you encounter online problems, use the tools provided by the OS and JVM reasonably (such as DF, free, TOP, jSTACK, jMAP, ps, etc.), and check these aspects one by one.

One caveat: Most of the tools provided by the JVM can affect performance, so if your application is deployed on a stand-alone basis, it’s best to migrate traffic (change DNS, Nginx configuration, etc.) before you troubleshoot problems. If your program is deployed in cluster mode, one of the machines can be isolated for on-site retention and easier debugging of problems. At the same time, if the online machine is unable to provide normal service, the first thing to do before troubleshooting is to “timely stop loss”, you can use the version rollback, service degradation, restart the application to restore the current node to normal service.

3.2 JVM Memory Overflow (OOM)

To understand memory overflow:

For example, a wooden bucket can only hold 40L of water, but if you pour 50L into it, the extra water will overflow over the top of the bucket. This situation is called out of memory.

Java memory overflow (OOM) is a common problem in the online check for out of memory (OOM). In the Java memory space, there are many areas, such as heap space, meta space, stack space, etc. For details, see the previous section on understanding the JVM runtime data area. In general, there are three main reasons for memory overruns in online environments:

  • ① The memory allocated for the JVM is too small to support data growth during normal program execution.
  • (2) There are internal problems and bugs in the Java program, resulting in the GC recovery rate can not keep up with the allocation rate.
  • ③ There is a memory overflow problem in your own code or imported third-party dependencies, resulting in insufficient available memory.

The above (2) and (3) problems are all caused by the loose code of Java program written in OOM. A large number of junk objects are generated in Java memory, resulting in no free memory allocation for new objects, resulting in overflow.

When checking OOM problems, the core is: where is OOM? Why OOM? How to avoid OOM? At the same time, the troubleshooting process should be based on data analysis, that is, Dump data. You can obtain heap Dump files in either of the following ways: ① Set -xx :HeapDumpPath during startup. If OOM is specified in advance, Dump files will be automatically exported. ② Restart the program and export it through a tool, such as JMAP or a third-party tool, after running the program for a period of time.

3.2.1. Java Online OOM row verification operation

The simulated cases are as follows:

/ / JVM startup parameters: - Xms64M - Xmx64M - XX: + HeapDumpOnOutOfMemoryError
// -XX:HeapDumpPath=/usr/local/java/java_code/java_log/Heap_OOM.hprof
public class OOM {
    // Test the object class for memory overflow
    public static class OomObject{}
    
    public static void main(String[] args){
        List<OomObject> OOMlist = new ArrayList<>();
        // infinite loop: repeatedly adding object instances to the collection
        for(;;) { OOMlist.add(newOomObject()); }}}Copy the code

On Linux, start the above Java program by running it from the background:

root@localhost ~]# java -Xms64M -Xmx64M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/java/java_code/java_log/Heap_OOM.hprof OOM &
[1] 78645
Copy the code

/usr/local/ Java /java_code/java_log/ Dump this file directly into the Eclipse MAT(Memory Analyzer Tool), and it will automatically analyze the OOM cause for you, and then improve the corresponding code based on the results of its analysis.

In this case, you will receive an OOM exception message after you run it:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3210)
	at java.util.Arrays.copyOf(Arrays.java:3181)
	at java.util.ArrayList.grow(ArrayList.java:261)
	at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
	at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
	at java.util.ArrayList.add(ArrayList.java:458)
	at OOM.main(OOM.java:13)
Copy the code

This case is not of much reference value. In fact, it also includes most other people’s OOM inspection process. Relatively speaking, the reference value is not much, because when troubleshooting OOM problems, you only need to have rational thinking and the steps are about the same. So the next key to clarify the screening of OOM train of thought.


Fault Diagnosis:

  • ① Get firstDumpFile, preferably configured during online deployment, so that the first field can be retained, but if the corresponding parameters are not configured, you can reduce the heap space, and then reconfigure the parameters when restarting the program, so as to achieve “field” reappearance.
  • ② If you cannot obtain it from OOM by setting parametersDumpFile, that can wait for the program to run online for a period of time, and coordinate the test personnel on each interface pressure test, and then active throughjmapTools such as export heapDumpFile (this way there is no program automatically exportedDumpGood file effect).
  • (3) toDumpThe file is transferred locally and then passed through the associatedDumpAnalysis tools such as JDKjvisualvmOr a third partyMATTools, etc.
  • (4) According to the analysis results, try to locate the problem. Locate the area where the problem occurs first, for example, determine whether the out-of-heap memory or in-heap space is overflowed, and if it is in-heap, which data area is overflowed. After determining the overflow area, analyze the causes of the overflow (common OOM causes will be listed below).
  • ⑤ According to the location of the area and the cause, make corresponding solutions, such as: optimize the code, optimize SQL, etc..

3.2.2 Summary of online memory overflow Problem

Java program online problems need to be checked, memory overflow problem is absolutely “frequent”, but usually, OOM is mostly caused by code problems, easy to cause OOM in the program:

  • ① One time from the external volume is too large data into memory, such as DB reading table, reading local report files.
  • ② Using containers in the program (Map/List/SetEtc.) after the timely cleaning, memory is tight and GC cannot recover.
  • ③ There are dead loops or a large number of loops in the program logic, or a large number of repeated object instances are generated in a single loop.
  • ④ There are bugs in the third-party dependencies introduced in the program, which lead to memory failures.
  • ⑤ There is a memory overflow problem in the program, which has been eating the available memory, and the GC cannot reclaim the memory overflow.
  • ⑥ The third party relies on loading a large number of class libraries, and the metadata space cannot load all the class metadata, thus causing OOM.
  • 7)…

These are the reasons why the code in the program causes OOM. When you encounter this kind of situation online, you need to locate the problem code, and then fix the code and go online again. At the same time, in addition to the code induced OOM situation, sometimes because the program allocated too small memory will also cause OOM, this situation is the best to solve, reallocate more memory space can solve the problem.

However, in The Java program, the heap space, meta-space, stack space and other areas may appear OOM problem, most of the reason for the overflow of meta-space is caused by the insufficient allocation of space, of course, it does not rule out the existence of “exceptional class library” resulting in OOM. The stack space in the real sense of OOM is almost difficult to meet online, so in the actual online environment, the heap space OOM is the most common, most need to check OOM problems, almost all overflow heap space.

3.3. JVM memory leaks

To understand memory leaks:

For example: a wooden bucket can only hold 40L of water, but now I throw a 2KG gold bar into it, then the bucket can only hold 38L of water at most in the future. When this happens, the program memory is called a memory leak. PS: do not consider the density of the object, for example do not die knock!

There is some confusion between the concepts of memory leak and memory overflow, but they are fundamentally two different issues. However, memory overflow may be caused by memory leak, but memory leak cannot be caused by OOM.

There are two main types of memory leaks in online Java programs:

  • (1) In-heap leakage: memory leakage caused by improper code, such as garbage objects holding references to static objects, not properly closing external connections, etc.
  • ② Leakage outside the reactor: applicationbufferThe memory is not released after the flow, and data in the direct memory is not manually cleared.

A memory leak cannot be detected directly during online inspections because it is difficult to detect a memory leak unless object changes in heap space are monitored. Therefore, it is common for leaks to occur in offline Settings to be accompanied by OOM problems:

When troubleshooting OOM faults, it is found that memory leaks continuously eat up the free available memory. As a result, there is no free memory to allocate when new objects are allocated, resulting in memory overflow.

3.3.1. JVM memory leak detection

Example of memory overflow simulation:

/ / JVM startup parameters: - Xms64M - Xmx64M - XX: + HeapDumpOnOutOfMemoryError
// -XX:HeapDumpPath=/usr/local/java/java_code/java_log/Heap_MemoryLeak.hprof
// It takes a long time to detect a memory leak causing an OOM problem without limiting it.
public class MemoryLeak {
    // Long-life object, static type root node
    static List<Object> ROOT = new ArrayList<>();

    public static void main(String[] args) {
        // Keep creating new objects without manually removing them from the container
        for (int i = 0; i <=999999999; i++) { Object obj =newObject(); ROOT.add(obj); obj = i; }}}Copy the code

Start the program:

root@localhost ~]# java -Xms64M -Xmx64M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/java/java_code/java_log/Heap_MemoryLeak.hprof OOM &
[1] 78849
Copy the code

After waiting for a while, the following exception message is displayed:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3210)
	at java.util.Arrays.copyOf(Arrays.java:3181)
	at java.util.ArrayList.grow(ArrayList.java:261)
	at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
	at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
	at java.util.ArrayList.add(ArrayList.java:458)
	at MemoryLeak.main(MemoryLeak.java:14)
Copy the code

At first glance, this looks like the OOM problem we analyzed earlier, but it’s not. In A Java program, the GC thread will theoretically reclaim objects created when they run out of memory, but since these objects are eventually referenced by the static ROOT member, the static member is treated as a GcRoots node by the JVM. Therefore, all objects created after use are reachable through the ROOT because they have a reference to the ROOT member, resulting in the GC mechanism not being able to reclaim these “invalid” objects.

In this case, the execution results of the program appear to be memory overflow, but are actually memory leak.

Of course, the above case is only a simple simulation of the memory leak situation, the actual development process may be much more complex, such as:

An object is connected to a static member during the execution of a service logic, but the object is used once and will not be used again. However, the memory space occupied by this “abandoned object” is never reclaimed by the GC because it is not manually disconnected from the static member.

Therefore, when you create an object that needs to be connected to a static object, but after using it once it is clear that the object will not be used again, you should manually empty or remove the reference between the object and the static node. In the case above, the last step should be root.remove (obj).

3.3.2 Summary of online memory leakage Problems

If you encounter an OOM problem caused by memory leak, you should first confirm whether it is the heap memory leak or the off-heap memory leak. After all, there may be a potential memory leak in both the heap space and meta-space. If you find out the location of the memory leak, you can get twice the result with half the effort.

Common examples of memory leakage: ① External temporary connection objects are not properly closed after use, such as DB connection, Socket connection, file IO flow, etc. ② The newly created object in the program is not cleaned or disconnected in time after using the long life cycle object. As a result, the reference relationship of the new object has always existed and GC cannot recycle it. For example, associated with static objects, singleton objects. (3) After applying for out-of-heap memory, the memory is not manually released or cleared, resulting in memory leakage. For example, after applying for local memory through the magic type Unsafe, the memory is not cleared after using the Buffer Buffer.

There is a misconception about memory leaks, but don’t be misled.

“In Java, multiple non-root objects refer to each other and remain alive, causing a reference loop that prevents the GC mechanism from reclaiming the memory area occupied by the object, causing a memory leak.”

This may sound like a reasonable statement, but it doesn’t work in Java. Because the GC judgment algorithm in Java adopts the reachability analysis algorithm, objects whose roots are unreachable are judged as garbage objects and will be uniformly recycled. Therefore, even if there is a reference loop in the heap, it will not cause a memory leak.

3.4. Service threads are deadlocked

Deadlock refers to the phenomenon that two or more threads (or processes) are waiting for each other in the process of running because of resource competition. If there is no external force, the waiting state will not be released, and the execution between them will not continue. Here’s an example:

Panda in the forest and bamboo one day pick up a toy bow and arrow, all want to play with pandas, bamboo was speaking a person to play once, but behind the bamboo play to depend on, want to play again, so take a bow in his hand, and should have the panda play, turn front just so they ran to pick up the bamboo shoot arrows, then ran back then after the following conditions: Panda way: bamboo, quick give me the bow in your hand, it’s my turn to play….. Bamboo say: no, you give me the arrow of your hand first, I play again give you….. As a result, the panda waits for bamboo’s bow and bamboo waits for panda’s arrow, and both sides refuse to back down, resulting in a stalemate…….

When this happens in the program, it is called a deadlock condition. If it occurs, it must be intervened by external forces, and then the program continues to execute after breaking the deadlock state. As in the above case, a third party must intervene to break the “deadlock” by bringing the bow in the hand of the “broken bamboo” to the panda.

3.4.1 Online deadlock investigation small combat

Here’s a simple example to feel the deadlock situation:

public class DeadLock implements Runnable { public boolean flag = true; Private static Object o1 = new Object(), o2 = new Object(); private static Object o1 = new Object(); public DeadLock(boolean flag){ this.flag = flag; } @override public void run() {if (flag) {synchronized (o1) {system.out.println (" "+ Thread. CurrentThread (). The getName () +" hold o1..." ); try { Thread.sleep(500); } catch (Exception e) { e.printStackTrace(); } system.out.println (" Thread: "+ thread.currentThread ().getName() +" wait for o2....") ); synchronized (o2) { System.out.println("true"); } } } if (! Flag) {synchronized (o2) {system.out.println (" Thread: "+ thread.currentThread ().getName() +" hold o2...." ); try { Thread.sleep(500); } catch (Exception e) { e.printStackTrace(); } system.out.println (" Thread: "+ thread.currentThread ().getName() +" wait o1....") ); synchronized (o1) { System.out.println("false"); } } } } public static void main(String[] args) { Thread t1 = new Thread(new DeadLock(true),"T1"); Thread t2 = new Thread(new DeadLock(false),"T2"); T1.run () may run t1.start() before t2.run(); t1.start() may run before t2.run(); t2.start(); }}Copy the code

Here is a simple deadlock example in this code:

  • whenflag==true, get the object firsto1The lock is acquired successfully and sleeps500msAnd what’s going to happen ist1Because, inmainIn the method, we’re going tot1The task offlagThe explicit transposetrue.
  • And when thet1While the thread is sleeping,t2The thread startst2The task offlag=false, so it goes to get the objecto2To lock the resource, and then sleep after successful acquisition500ms.
  • At this timet1The thread sleep time is over,t1The thread wakes up and continues to execute, and then needs to fetcho2Object locks resources, but at this timeo2Has beent2Hold, at this pointt1Blocks the wait.
  • But this momentt2The thread also wakes up from sleep and continues to execute and then needs to fetcho1Object locks resources, but at this timeo1Has beent1Hold, at this pointt2Blocks the wait.
  • And ultimately leads to threadsT1, t2The resources of each waiting object must be obtained from the resources held by the other object before the execution can continue, resulting in deadlock.

The result is as follows:

D:\> javac - Encoding UTF-8 DeadLock. Java D:\> Java DeadLock thread: T1 holds O1.... Thread: T2 holds O2.... Thread: T2 wait for O1.... Thread: T1 Waits for O2....Copy the code

In the above case, T1 can never actually get o1 and T2 can never get O2, so a deadlock situation occurs. What if on the line we don’t know where the deadlock is happening? There are several ways to locate the problem:

  • (1) throughjps+jstackTool troubleshooting.
  • (2) byjconsoleTool troubleshooting.
  • (3) byjvisualvmTool troubleshooting.

Of course, you can also use other third-party tools to troubleshoot the problem, but the previous method is the JDK built-in tools, but most Java programs are deployed on Linux, so the latter two visual tools are not convenient to use. Therefore, in the online environment, the first JPS + JSTACK method is more used.


Next, we use JPS +jstack to check for deadlocks. At this time, keep the original CMD /shell window open, and then open a new window, enter JPS command:

D:\> jps
19552 Jps
2892 DeadLock
Copy the code

JPS is used to display the status of the Java process and its process ID. The result shows that the process whose ID is 2892 is the Java program that generated the deadlock. In this case, you can use the JStack tool to view the dump log of the process as follows:

D:\> jstack -l 2892
Copy the code

The following information is displayed:

It is clear from the dump log that the JStack tool detected a DeadLock problem in this process caused by threads named T1 and T2, and that DeadLock problems could be caused by DeadLock. Java :41, DeadLock. Java :25 lines. At this point, the deadlock location has been identified, so we can follow up the code to troubleshoot the problems in the program, and optimize the code to ensure that the deadlock does not happen again.

PS: steal a lazy, the investigation of deadlock small combat is from the previous “in-depth concurrent thread, process, fiber, coroutine, pipe and deadlock, live lock, lock hunger in detail” in the article copied over, at that time the environment is based on the Windows system, but the Linux system operation is the same steps.

3.4.2 Deadlock problem summary

Deadlock problems in Java programs are usually caused by code irregularities, so when troubleshooting deadlock problems, you need to locate the specific code that causes the deadlock problem, and then improve it and put it back online.

3.5. The application fails unexpectedly

It is not uncommon for A Java application to go down after it is deployed online, but it can be caused by a number of factors, such as: Computer room environment, hardware problems of the server, avalanches caused by other upstream and downstream nodes in the system, Java application itself (frequent GC, OOM, traffic crash, etc.), Trojan horses or mining machine scripts implanted in the server may all lead to abnormal program downtime. To deal with this kind of outage, due to the uncertainty of the cause, this problem is more by the development, operation and network security personnel to jointly solve, we need to do is to ensure that when the situation occurs, to ensure that the program can immediately restart and can timely notify operation and maintenance personnel to help troubleshoot. So in this case, keepalived is the solution.

3.5.1 Online Java application downtime processing small practice

Keepalived is a hot standby, high availability good application, you can install it yourself, the main functions of the application: regular script execution, failure to send a letter to the specified email, host downtime can do drift, we mainly use its alarm and periodic execution of script function.

After installing Keepalived, you can use the vi command to edit the keeplived.conf file and change its internal monitoring script configuration module to the following:

Vrrp_script chk_nginx_pid {# run this script: Automatic open service script "/ usr/local/SRC/scripts/check_java_pid. Sh" 4 # test interval of the interval (4 seconds) weight - # 20 if conditions was established, the weight - 20}Copy the code

The check_java_pid.sh script is as follows:

java_count=`ps -C java --no-header | wc -l` if [ $java_count -eq 0 ]; Then the Java/usr/local/java_code/HelloWorld sleep 1 # this is used for drift tube (no) if [` ps - C Java - no - the header | wc -l ` - eq 0]; then /usr/local/src/keepalived/etc/rc.d/init.d/keepalived stop fi fiCopy the code

The helloworld.java file code is as follows:

public class HelloWorld{
    public static void main(String[] args){
       System.out.println("hello,Java!");
       for(;;) {}}}Copy the code

Once the environment setup is complete, you can test the effect by launching a Java application HelloWorld:

#Start the Java application
[root@localhost ~]# java /usr/local/java_code/HelloWorld

#Viewing Java Processes/ root @ localhost ~ # ps aux | grep Java root 69992 153884 7968 0.1 0.7? SS 16:36 0:21 Java root 73835 0.0 0.0 112728 972 PTS /0 S+ 16:37 0:00 grep --color=auto JavaCopy the code

Then enable script execution permission and launch Keeplived:

#Enable script execution permission (as root, this step can be omitted)
[root@localhost ~]# chmod +x /usr/local/src/scripts/check_java_pid.sh

#Go to the Keepalived installation directory and start the Keepalived application/ root @ localhost ~ # CD/usr/local/SRC/keepalived / [root @ localhost keepalived] # keepalived - 1.2.22 / bin/keepalived etc/keepalived
#Check out the Keepalived daemon
[root@localhost keepalived]# ps aux | grep keepalived
root  73908  0.0  0.1  42872  1044 ?  Ss  17:01  0:00 keepalived
root  73909  0.0  0.1  42872  1900 ?  S   17:01  0:00 keepalived
root  73910  0.0  0.1  42872  1272 ?  S   17:01  0:00 keepalived
Copy the code

After all the previous programs have run, now manually default to Java application downtime, that is, to kill the Java process, as follows:

# kill-9 69992: Forcibly kill Java process (69992 is the process ID of the previous Java application)
[root@localhost ~]# kill -9 69992

#Query Java daemons (there are no Java processes at this point, because they have just beenkillA)/ root @ localhost ~ # ps aux | grep Java root 76621 0.0 112728 0.0 972 PTS / 0 S + 17:03 0:00 grep -- color = auto Java
#Query the Java daemon process again at an interval of about three seconds/ root @ localhost ~ # ps aux | grep Java root 79921 153884 7968 0.1 0.7? SS 17:08 0:21 Java root 80014 0.0 0.0 112728 972 PTS /0 S+ 17:08 0:00 grep --color=auto JavaCopy the code

After a few seconds, you will find that the Java application in the background has been resurrected again.

3.5.2 Summary of Online Java Application downtime

Keepalived is a handy tool. You can also configure its email reminders service to send emails to specific email addresses when there is a problem or a reboot. But this restart is a means of palliation, if you want to completely solve the problem of downtime, but also need to start from the root point, fundamentally solve the cause of the program downtime.

3.6. Threads block/respond slowly

Slow response speed and thread blocking, the relationship between the inseparable, threads in Java service obstruction due to encountered during the implementation of an emergency, so for the client, direct feedback is the response speed is slow in the past, so when the thread block will inevitably cause client response slow or no response, but on the other hand, Thread blocking is not the only reason for slow response times. Slow response times are a “compound” problem like Java application outages, Java application full threads in the obstruction, a TCP connection, SQL execution time is too long, hardware machine hard disk/CPU/memory resources nervous, too much traffic, third party middleware or upstream system interface appear abnormal situation, the application does not handle the circumstance such as static load or a time resources too much can cause slow response speed, so trying to identify this kind of problem, It’s also a matter of experience. However, there are also rules when troubleshooting problems with no response or slow response:

  • (1) The overall response of the system is slow: if the overall response of the program is too slow, then it is caused by excessive pressure, abnormal conditions in the downstream system, problems in the current Java application, problems in the current machine (network/hardware/environment), problems in the system of the current program and so on. In other words, only a total breakdown or failure in one layer of the application system can lead to a slow response of the program as a whole.
  • ② Slow response of a single interface: If the program in an interface or some kind of interface response speed too slow, but other interface response is normal, that there is no doubt that definitely due to SQL, exist within the interface implementation issues and other reasons, such as the number of query is too large, internal call third-party interface problems, internal logic code is not correct lead to blocking threads, thread deadlock situation and so on.

In fact, the above two can be understood as the difference between point and surface, one is “comprehensive” nature, and the other is “single point” nature. In addition to the scope, it can also be divided from the perspective of the occurrence stage, such as: persistent slow response, indirect slow response, sporadic slow response. There are many reasons for the slow response, so when this kind of situation is encountered online, rationally analyze the cause of the problem and then optimize it at different levels according to different situations, such as: Multi-threaded execution, asynchronous callback notification, introduction of caching middleware, MQ peak clipping, read/write separation, static separation, cluster deployment, joining search engine……. , can be understood as a solution to optimize response speed.

3.7 CPU utilization remained high or soared by 100%

CPU100% and OOM memory overflow are common topics in Java interview. Cpu100% is a simple online problem, because after all, the scope has been determined, cpu100% will only occur on the machine where the program is installed, so there is no need to determine the scope of the problem. So you just need to locate the specific processes that are causing CPU spikes on individual machines, and then troubleshoot the problems and fix them.

3.7.1 online CPU100% check small combat

The simulated Java case code is as follows:

public class CpuOverload {
    public static void main(String[] args) {
        // Start ten dormant threads (simulate inactive threads)
        for(int i = 1; i <=10; i++){new Thread(()->{
                System.out.println(Thread.currentThread().getName());
                try {
                    Thread.sleep(10*60*1000);
                } catch(InterruptedException e) { e.printStackTrace(); }},"InactivityThread-"+i).start();
        }
        
        // Start a thread that keeps looping (simulating a thread that causes CPU spikes)
        new Thread(()->{
            int i = 0;
            for (;;) i++;
        },"ActiveThread-Hot").start(); }}Copy the code

Start with a new shell-SSH window and launch the Java application to simulate a CPU surge:

[root@localhost ~]# javac CpuOverload.java
[root@localhost ~]# java CpuOverload
Copy the code

Then in another window, through the top command to view the process status of the system background:

[root@localhost ~]# top top-14:09:20 up 2 days, 16 min, 3 users, load Average: 0.45, 0.15, 0.11 Tasks: 98 total, 1 running, 97 sleeping, 0 stopped, 0 zombie%Cpu(S): 100.0us, 0.0sy, 0.0Ni, 0.0ID, 0.0wa, 0.0HI, 0.0Si, 0.0STKiB Mem : 997956 total, 286560 free, 126120 used, 585276 buff/cache KiB Swap: 2097148 total, 2096372 free, 776 used. 626532 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 77915 root 20 0 2249432 25708 11592 S 99.9 2.6 0:28.32 Java 636 root 20 0 298936 6188 4836 S 0.3 0.6 3:39.52 vmToolsd 1 root 20 0 46032 5956 3492 S 0.0 0.6 0:04.27 systemd 2 root 20 00 0 S 0.0 0.0 0:00.07 kthreadd 3 root 20 00 0 S 0.0 0.0 0:04.21 ksoftirqd/0 5 root 0 -20 00 S 0.0 0.0 0:00.00 kworker/0 0H 7 root rt 00 0 S 0.0 0.0 0:00.00 Migration /0 8 root RT 00 00 S 0.0 0.0 0.0 00.00 rcu_bh 9 root 20 00 00 S 0.0 0.0 0:11.97 rcu_sched.......Copy the code

As you can see from the above results, the CPU usage of the Java process with PID 77915 reaches 99.9%, so you can be sure that the CPU utilization of the machine is caused by the Java application.

In this case, you can run the top-hp [PID] command to view the thread with the highest CPU usage in the Java process.

[root@localhost ~]# top -Hp 77915 ..... Omit information about system resources...... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 77935 root 20 0 2249432 26496 11560 R 99.9 2.7 3:43.95 Java 77915 Root 20 0 2249432 26496 11560 S 0.0 2.7 0:00.00 Java 77916 root 20 0 2249432 26496 11560 S 0.0 2.7 0:00.08 Java 77917 Root 20 0 2249432 26496 11560 S 0.0 2.7 0:00.00 Java 77918 root 20 0 2249432 26496 11560 S 0.0 2.7 0:00.00 Java 77919 Root 20 0 2249432 26496 11560 S 0.0 2.7 0:00.00 Java 77920 root 20 0 2249432 26496 11560 S 0.0 2.7 0:00.00 Java 77921 Root 20 0 2249432 26496 11560 S 0.0 2.7 0:00.01 Java.......Copy the code

The top-hp 77915 command output shows that the other threads are in hibernation state and do not hold CPU resources. The thread whose PID is 77935 occupies 99.9% of CPU resources.

At this point, the “culprit” causing the CPU utilization spike has emerged. At this time, first convert the PID of the thread to a hexadecimal value, which is convenient for further troubleshooting log information:

[root@localhost ~]# printf %x 77935
1306f
Copy the code

So far, we have obtained the initial number of the “culprit”. Then we can use the jStack tool to view the stack information of the thread, and use the hexadecimal thread ID to search for it:

[root@localhost ~]# jstack 77915 | grep 1306f "ActiveThread-Hot" #18 prio=5 os_prio=0 tid=0x00007f7444107800 nid=0x1306f  runnable [0x00007f7432ade000]Copy the code

In this case, the thread whose ID is 1306F is activeThread-hot. Alternatively, you can export the thread stack information and view the details in the log as follows:

[root@localhost ~]# jstack 77915 > java_log/thread_stack.log [root@localhost ~]# vi java_log/thread_stack.log ------------- then press/and enter the thread ID: 1306f------------- "ActiveThread-Hot" #18 prio=5 os_prio=0 tid=0x00007f7444107800 nid=0x1306f runnable [0x00007f7432ade000] java.lang.Thread.State: RUNNABLE at CpuOverload.lambda$main$1(CpuOverload.java:18) at CpuOverload$$Lambda$2/531885035.run(Unknown Source) at java.lang.Thread.run(Thread.java:745)Copy the code

In the thread stack log, the thread name, thread state, and which line of code of this thread consumes the most CPU resources are listed in detail. The next thing to do is to correct the code in the Java application and redeploy it according to the located code.

Of course, if the executing jstack 77915 | grep order to 1306 f, Os_prio =0 TID = 0x00007F871806e000 nID = 0xA Runnable VM Thread os_PRIO =0 TID = 0x00007F871806e000 nID = 0xA Runnable VM Thread In this case, it is necessary to further examine the JVM’s own threads, such as GC threads, compilation threads, and so on.

3.7.2 CPU100% troubleshooting summary

Cpu100% troubleshooting steps are almost dead templates:

  • 1.topCommand to check the resource usage of the system background process and determine whether it is caused by Java applications.
  • (2) usetop -Hp [pid]Take a closer look at the threads with the highest CPU usage in Java applications.
    • Save the threads with the highest CPU usagePIDAnd convert it to a hexadecimal value.
  • (3) byjstackThe tool exports stack thread snapshot information for Java applications.
  • ④ Through the hexadecimal thread ID converted above, search in the thread stack information, locate the specific code that causes the CPU surge.
  • ⑤ Check whether the thread causing the CPU surge is the VM thread of the VM or the business thread.
  • ⑥ If it is the business thread is the code problem, according to the stack information to modify the correct code, the program will be redeployed online.
  • 7) if it isVMThreads, which can be caused by frequent GC, frequent compilation, and other JVM operations, need further investigation.

Problems such as CPU spikes generally have only a few causes: ① problems in the business code, such as an infinite loop or a large amount of recursion. ② Too many threads are created in Java applications, causing frequent context switches and consuming CPU resources. ③ Virtual machine threads execute frequently, such as frequent GC, frequent compilation, etc.

3.8. Brief discussion on other online problems

In the previous content, a variety of online problems and their solutions are elaborated in detail, but in fact, there are also a variety of online “problems”, such as disk utilization rate of 100%, DNS hijacking, database extortion, Trojan virus intrusion, mining machine script implantation, network failure and so on. At the same time, the tools to deal with these problems need to be accumulated from experience, which is a “valuable asset” developers should learn in the work.

Iv. Online investigation and summary

Online troubleshooting is more based on experience. The more experienced developers encounter such problems, the more comfortable they will be in dealing with them. When they have more experience in online troubleshooting, even if they encounter some problems they have not encountered before, they will also be able to troubleshoot them without being at a loss.

All in all, there is no cookie-cutter method for online investigation of all kinds of problems. Rich experience + powerful tools + rational thinking is the only way to deal with such problems. However, the way of investigation will not change, and the steps are roughly the same, which is also mentioned in the beginning:

Analyze the problem, troubleshoot the problem, locate the problem, solve the problem, try the optimal solution.