Classification of problems:

  1. The CPU problem

  2. Memory issues (GC issues, memory leaks, OOM, Coredump, etc.)

  3. I/O problems

Troubleshooting Kit:

System-level tools:

  1. Top: a mandatory tool for viewing the CPU, memory, and swap usage of the system or process.

  2. Pmap: Can be used to analyze the internal memory distribution of a process.

  3. Strace: used to track system calls and received signals when the process is executing. For example, it can be used to track the process to apply for memory resources.

  4. Gperftools: A performance analysis tool for memory leaks, CPU performance detection, and more.

  5. GDB: a powerful command-line debugging tool that is essential for C/C ++ development (JVM implementations are in C).

  6. Iostat: used to dynamically monitor disk operation activities of the system.

  7. Iotop: The interface is similar to TOP, but monitors disk I/O usage details for Linux.

  8. Vmstat: monitors the virtual memory, process, and CPU activities of the OPERATING system in real time.

  9. Netstat: various network-related information, such as network connection, routing table, and interface status.

  10. Dstat: a multifunctional product that can replace vmstat, iostat, netstat, and ifstat commands.

Java level commands:

  1. Jps: View the Java process.

  2. Jstat: Monitoring JVM statistics.

  3. Jinfo: View the running environment parameters of a running Java program.

  4. Jstack: Views the current running status of all threads in the JVM and the current state of the threads.

  5. Jcmd: It is a versatile tool for exporting heap, viewing Java processes, exporting thread information, performing GC, sampling analysis, and more.

  6. Jmap: Check the heap memory usage of the JVM process, is the heap memory locator magic, often used with MAT.

  7. VJmap: a tool developed by Vipshop, a generational version of JMAP, but only for CMS GC.

  8. Btrace: A tool that dynamically tracks application operation details without restarting the service.

  9. Arthas: A very powerful Java diagnostic tool; You can track code dynamically and monitor JVM status in real time.

  10. MAT(Memory Analyzer) : JVM Memory and thread analysis tool that provides Memory analysis reports.

  11. GCLogViewer: GC log trend analysis tool that detects memory leaks and log comparisons.

  12. JProfiler: Java performance bottleneck analysis tool, especially powerful analysis of CPU,Thread,Memory, is a necessary tool to do performance optimization.

Common problem analysis steps:

The CPU problem:

Troubleshooting Idea: In general, you can run the top command to check whether the CPU load is high and find the process (Java process) that occupies the highest CPU usage. Then use “top-h-P PID” to find which threads are the most occupied, and finally use jStack to see which threads are the most occupied. In addition, we can also depend on whether the kernel mode consumes too much CPU or the user mode consumes too much CPU; If the kernel mode needs to pay attention to CPU switching, locking, IO, etc., if the user mode needs to pay attention to computing, cycling, GC and other issues.

Example: CPU context switch leads to service avalanche.

Memory problem:

Coredump:

In most cases, the coredump process leaves a coredump file. The coredump file is stored in the /proc/sys/kernel/core_pattern file. In addition, the JVM itself will generate a crash report file, which can roughly reflect a current situation, but the Coredump file can obtain more information. You can use the GDB tool to debug the Coredump file to find the cause.

Example: How to locate the Java program coredump

Out of Memory:

The JVM is out of memory.

  • The Exception in the thread “main” Java. Lang. OutOfMemoryError: unable to create new native thread – there is not enough memory space for the Java thread stack.

    The workaround: There is a lot of literature that says you can fix this problem by adjusting Xss parameters, but in fact systems use deferred allocation. All systems do not allocate Xss real memory to each thread, but on demand.

    This is at least 95% of the time due to forgetting to call shutdown while using the ExecutorService, and a small number of times due to system parameter configuration problems, such as max_threads and max_user_processes being too small.

  • Exception in the thread “main” Java. Lang. OutOfMemoryError: Java heap space – heap memory footprint has already reached a maximum of the -xmx setting.

    Solution: Adjust the value of -xmx, or there is a memory leak, according to the memory leak thinking.

  • Under Caused by: Java. Lang. OutOfMemoryError: PermGen space – the way the memory footprint has been achieved – XX: MaxPermSize set maximum.

    Solution: Adjust the value of -xx :MaxPermSize.

StackOverflow:

  • The Exception in the thread “main” Java. Lang. StackOverflowError – thread stack need more memory than Xss values.

    Solution: Adjust the value of Xss.

Heap memory leak:

  1. Check to see if the GC situation is normal. Memory leaks in the heap are always associated with GC exceptions.

  2. Jmap-dump :live,file=mem.map pid Dumps memory.

  3. Memory analyzer (MAT) is used to analyze memory objects and call chains and discover objects that cannot be reclaimed.

Out-of-heap memory leak:

Unsafe, ByteBuffer, and native memory applications are unsafe and unsafe. For example, the unsafe application scenario is Netty, and the native application scenario is ZipFile. Ninety percent of the out-of-heap memory leaks I encountered were related to these two scenarios, as well as other scenarios, such as using JavaCPP directly to apply for out-of-heap memory (the underlying application is native). For out-of-heap memory leaks, the combination of GperfTools and BTrace can generally handle them, but if not, you may need a low-level tool such as Strace.

In addition, although out-of-heap memory is not in the heap, its references are in the heap; Sometimes when it is inconvenient or inconclusive to look directly out of the heap, you can check to see if there are exception objects in the heap.

Example: “Out-of-heap memory leak” caused by Spring Boot

GC troubleshooting:

  • Young GC time is too long

    Check whether the time is in Root Scanning, Object Copy, or Ref Proc according to GC log(G1). If Ref Proc takes a long time, you need to focus on the reference related objects; If Root Scanning takes a long time, you need to pay attention to the number of threads and cross-generation references. Object Copy takes care of the Object lifetime (you can use the VJmap tool).

    Example: After two services are merged online into one service, it is found that the timeout of the upstream service increases and the overall service availability decreases. By checking the monitoring, it is found that the young GC time is much higher. The young GC log is as follows:

    Compared with other projects, Root Scanning is found in a relatively high time, and there are too many threads (more than 4000) after checking project monitoring. Hystrix semaphore +RPC asynchrony was used to transform the project, reducing the number of threads to about 800. The average time of the Young GC was reduced from 37 to 21 milliseconds, as expected.

  • Young GC frequency is too high

    Check whether parameters such as -xMN and -xx :SurvivorRatio are set properly and whether JVM parameters can be adjusted to reach the destination. If the parameters are ok, but the young GC frequency is still too high, you need to use tools such as Jmap/MAT to see if the objects generated by the business code are reasonable.

    Example: Once when the project was accessing some users’ full link logs, it was found that the frequency of Young GC soared a lot. Using Jmap to discover JSON objects, the code for discovery is as follows:

    As you can see from the code, the JSON object will be generated regardless of whether the full link log is entered, which is obviously not as expected.

  • Full GC takes too long

    Check whether the old area is too large. If the old area is too large, adjust it to a smaller size. If it is CMS, you can check whether the time is too long in the Inital mark phase. If it is Re Mark, you can add -xx :+CMSScavengeBeforeRemark. If the service is too concerned about the time of full GC, active FULL GC can be adopted. The specific implementation principle can be referred to: Talk about some problems of active full GC in the project.

  • The full GC frequency is too high

    This is one of the most common and complex situations in Java. For example, in the case of a CMS GC, perm or OLD reaches the threshold. If it’s old, it depends a lot on why the object is promoted to the old age so quickly. If you are configuring a CMS and a true full GC is occurring, you need to look at which full GC meets the criteria to find out why.

    Examples: Redis client link pool configuration caused by frequent full GC, a weird full GC lookup problem, full GC troubleshooting case

IO questions:

Iotop allows you to directly view which threads are doing I/O, and then use jStack to locate the specific code.

Methodology for troubleshooting problems

Common Java service problems may be accompanied by various phenomena, such as CPU surge and GC surge. At this time, we must follow the “symptom – problem – cause – solution” step to solve the problems.

  1. List all exceptions. For example: service response time surge, CPU surge, full GC surge;

  2. List all the problems. All abnormal phenomena in step 1 are not necessarily problematic. For example, phenomenon A leads to phenomenon B, so we should see phenomenon A as the problem and phenomenon B as the result, not the main direction of our problem search.

  3. Find out why. Problems in step 2, such as CPU spikes, are easier to troubleshoot than full GC spikes. So in this step, we’re going to go from easy to hard.

  4. Once the cause is identified, a specific solution can be presented and then validated.

Example: Troubleshooting of service response time spike