This is the 18th day of my participation in Gwen Challenge

If the profile

Online faults mainly include CPU, disk, memory, and network faults. Most faults may involve problems at more than one level. Therefore, try to check all four aspects in sequence. At the same time, such as jSTACK, JMAP and other tools are not limited to one aspect of the problem, basically the problem is DF, free, top three, and then jSTACK, JMAP service, specific problems can be analyzed.

The problem of the CPU

We usually start with CPU issues. CPU exceptions are often easier to locate. Reasons include business logic problems (endless loops), frequent GC, and too many context switches. The most common ones are caused by business logic (or framework logic), and you can use JStack to analyze the corresponding stack.

Jstack analyzes CPU problems

  1. Use the ps command to find the PID of the corresponding process (if you have several target processes, you can use the top command first) to find the threads with high CPU usage

top -H -p pid

Note that -p indicates the process number and -h indicates the thread with the highest output usage

  1. Convert the most occupied PID to hexadecimal to get the NID

printf ‘%x\n’ pid

  1. The corresponding stack information is then found directly in jStack

Jstack ‘0 x42 | grep’ nid ‘- C5 – color

You can see that we have found the stack information with nID 0x42, and then we just need to analyze it carefully.

  1. Scan the entire JStack file
  • Of course, it is more common to analyze the entire JStack file, and we usually focus on WAITING and TIMED_WAITING, not to mention BLOCKED.

  • Using the cat command jstack. Log | grep “Java. Lang. Thread. State” | sort – nr | uniq -c to jstack State have a whole grasp, if WAITING and so much more special, so mostly have a problem.

JVM Frequent GC (FullGC)

Use jStack to analyze the problem, but sometimes we can first determine if gc is too frequent. Use the jstat -gc pid 1000 command to observe gc generational changes. 1000 indicates the sampling interval (ms).

  • S0C/S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent the capacity and usage of two Survivor zones, Eden zone, old age zone, and metadata zone respectively.

  • YGC/YGT, FGC/FGCT, and GCT represent the YoungGc and FullGc time, times, and total time.

If you see gc frequently, do further analysis on the GC aspect

Context switch

For frequent context problems, you can use the vmstat command to check. Vmstat 1 indicates printing once every second.

  • The cs column represents the number of context switches.

If we want to monitor a particular PID, we can use the pidstat -w pid command, CSWCH and NVCSWCH for voluntary and involuntary switching.

Disk problems

State information

  • The disk problem is as basic as the CPU. The first is the disk space aspect, which we use directlydf -hlTo check the file system status

  • The disk issue is also a performance issue. Can be achieved byiostat -d -k -xTo analyze it.

Image fromBlog.csdn.net/pengjunlee/…

  • The last column, %util, shows the extent of each disk’s write, while RRQPM /s and WRQM /s indicate the read/write speed, respectively, which generally helps to locate the specific disk where the problem occurred.

  • In addition, we also need to know which process is doing the reading and writing, generally develop their own idea, or use the iotop command to locate the source of file reading and writing.

Pidreadlink -f /proc/*/task/tid/.. pidreadlink -f /proc/*/task/tid/.. /.. .

Image fromBlog.csdn.net/pengjunlee/…

Run the cat /proc/pid/io command to check the read/write status of the process

You can run the lsof command to determine the specific file read and write statuslsof -p pid

Images from blog.csdn.net/pengjunlee/…

Memory problems

The troubleshooting of memory problems is more troublesome than CPU, and the scene is more. This includes OOM, GC issues, and off-heap memory. Normally, we would use it firstfreeCommand to check the various conditions of the memory.

Within the heap memory

Memory problems are mostly heap memory problems. It is mainly divided into OOM and StackOverflow.

OOM issues

The memory in the JVM is insufficient. OOM can be divided into the following categories:

Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread
Copy the code
  • There is not enough memory to allocate the Java stack to threads, and it is basically thread pool code that has problems, such as forgetting to shutdown, or using an unlimited queue of tasks, so look at the code level first, using jstack or jmap.

  • If all is well, the JVM side can reduce the size of the individual Thread Stack by specifying Xss.

  • Also can at the system level, by modifying the/etc/security/limits the conf, nofiles and nproc to increase OS the limitation on the thread

The problem
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Copy the code
why

This means that the heap has reached the maximum memory usage of the -xmx setting, which is probably the most common OOM error.

The solution

The solution is still to find in the code, suspected memory leaks, through jStack and JMap to locate the problem. If all is well, you need to expand memory by adjusting the value of Xmx.

The problem
Caused by: java.lang.OutOfMemoryError: Meta space
Copy the code
The solution

-xx :MaxMetaspaceSize = MaxPermSize = MaxPermSize = MaxPermSize = MaxPermSize = MaxPermSize = MaxPermSize = MaxPermSize = MaxPermSize = MaxPermSize = MaxPermSize

The problem
Exception in thread "main" java.lang.StackOverflowError
Copy the code
The solution

Indicates that the memory required by the thread stack is greater than the Xss value, which is also checked first. The parameter is adjusted by Xss, but too large adjustment may cause OOM.

  • Enable JMAP to locate code memory leaks, as we do in OOM and StackOverflowjmap -dump:format=b,file=filename pidTo export dump file

  • Eclipse Memory Analysis Tools (MAT) is used to import dump files for Analysis. Generally, we choose Leak Suspects directly, and MAT gives suggestions for Memory Leak. Alternatively, you can select Top Consumers to view the maximum object report.

  • You can select Thread Overview to analyze thread related problems. In addition, select the Histogram class overview to do your own analysis, as you can do in mat’s tutorial.

Memory leaks in code are common, subtle and require more attention to detail.

  • For example, new objects are created on every request, resulting in a large number of repeated object creation.

  • File stream operation performed but not closed properly; Manual gc is triggered incorrectly, ByteBuffer is allocated incorrectly, etc.

This can be specified in the startup parameters-XX:+HeapDumpOnOutOfMemoryErrorTo save the OOM dump file.

Gc issues and threads

  • GC problems affect not only CPU but also memory. The troubleshooting method is the same. Jstat is used to check generational changes, such as youngGC or fullGC. Is the growth of EU, OU and other indicators abnormal?

  • Unable to create new native Thread: Unable to create new native thread: Unable to create new native threadIn addition to jStack’s detailed analysis of dump files, we usually first look at the overall thread, throughpstree -p pid |wc -l

Or you can view the number of threads in /proc/pid/task.

Out of memory

If you run into an out-of-heap memory overflow, that’s unfortunate. First of all, the performance of out-of-heap memory overflow is that the physical resident memory grows fast. If an error is reported, the usage mode is uncertain.

OutOfDirectMemoryError: Direct Buffer Memory if DirectByteBuffer is used, OutOfMemoryError will be reported in the error log.

Outside the heap memory leak is often associated with the use of NIO, usually we first through the pmap to view the process memory situation pmap -x pid | sort – rn – k3 | head – 30, the means to check the corresponding pid top 30 of the segment in reverse chronological order. You can run the command again after a certain amount of time to see the memory growth, or where the memory segment is more suspicious than the normal machine.

GDB –batch –pid {pid} -ex “dump memory filename.dump {memory start address} {memory start address + memory block size}” GDB –batch –pid {pid} -ex “dump memory filename.

Troubleshoot problems

After get the dump file available heaxdump view hexdump) -c filename | less, but most are binary garbled.

NMT is a new HotSpot feature introduced in Java7U40. With the JCMD command we can see the specific memory composition. You need to add -xx :NativeMemoryTracking=summary or -xx :NativeMemoryTracking=detail to the startup parameter, which causes slight performance loss.

JCMD PID VM. Native_memory baseline JCMD PID VM. Native_memory baseline JCMD PID VM.

Then wait a while to see how the memory grows and do summary or detail level diff with JCMD PID VM. Native_memory detail.diff(summary.diff).

As you can see, the JCMD analysis of memory is very detailed, including in heap, thread and GC (so the above mentioned other memory exceptions can be analyzed by NMT). In this aspect, we focus on the growth of Internal memory, if the growth is too obvious, then there is a problem.

At the detail level, there is also the growth of specific memory segments, as shown in the figure below

Strace -f -e “BRK,mmap,munmap” -p pid can be used to monitor memory allocation

The memory allocation information mainly includes PID and memory address

The key is to look at the error log stack, find the suspicious object, understand its recycling mechanism, and then analyze the corresponding object.

  • For example, DirectByteBuffer allocates memory that requires full GC or manual system. GC (so it is best not to use -xx :+DisableExplicitGC).

Jmap-histo: Live PID triggers fullGC manually to see if the out-of-heap memory has been collected.

If it is reclaimed, then there is a high probability that the off-heap memory itself is allocated too small and passes-XX:MaxDirectMemorySizeMake adjustments. If nothing changes, it is time to use JMap to analyze objects that cannot be gc and references to DirectByteBuffer.