A step-by-step guide to locating common Java performance problems

An overview of the

Performance optimization has always been the focus of back-end service optimization, but online performance failures do not often occur, or limited by business products, there is no way to have performance problems, including the author’s own performance problems are few, so in order to reserve knowledge in advance, when problems occur, I will not be in a hurry. This article simulates several common Java performance failures to learn how to analyze and locate them.

Preliminary knowledge

Since it is a locating problem, it is necessary to use tools. Let’s first understand what tools are needed to help locating the problem.

The top command

The top command is one of the most commonly used Linux commands. It can display the CPU usage, memory usage and other system information of the currently executing process in real time. Top-hp PID Displays the system resource usage of the thread.

The vmstat command

Vmstat is a virtual memory detection tool that collects statistics on memory usage, CPU usage, and swap usage. It also has an important function that is used to observe the context switch of processes. The fields are described as follows:

R: number of processes in the runqueue (blocked threads if the number is greater than the number of CPU cores)
B: Number of processes waiting for I/OS
SWPD: uses the virtual memory size
Free: indicates the size of free physical memory
Buff: size of memory used for buffering (memory and disk buffers)
Cache: The amount of memory used for caching (the buffer between the CPU and memory)
Si: The amount of memory written from the swap area to the disk per second
So: indicates the size of memory written to the swap area per second
Bi: Number of blocks read per second
Bo: Number of blocks written per second
In: Number of interrupts per second, including clock interrupts.
Cs: number of context switches per second.
Us: percentage of user process execution time
Sy: percentage of kernel system process execution time
Wa: percentage of IO waiting time
Id: percentage of idle time

Pidstat command

Pidstat is a component of Sysstat and a powerful performance monitoring tool. The top and vmstat commands monitor memory, CPU, and I/O usage of processes, while the pidstat command can detect thread level. The following describes the thread switchover fields of the pidstat command:

UID: indicates the real user ID of the monitored task.
TGID: indicates the ID of a thread group.
TID: indicates the ID of a thread.
CSWCH /s: The number of active context switches, where threads are switched because resources are blocked, such as lock waits.
NVCSWCH/S: the number of times the context is switched passively.

Jstack command

Jstack is a JDK tool command, it is a thread stack analysis tool, the most common function is to use the jstack PID command to view the stack information of the thread, also often used to rule out deadlocks.

Jstat command

It can detect real-time Java program execution, including heap memory information and garbage collection information, which we often use to see program garbage collection. The common command is jstat -gc pid. The information fields are described as follows:

S0C: capacity of young generation To Survivor (KB);
S1C: Capacity of From Survivor in the young generation (in KB);
S0U: the space currently used by To Survivor in the young generation (KB);
S1U: the space currently used by From Survivor in the young generation (in KB);
EC: capacity of Eden in the young generation (unit: KB).
EU: the space currently used by Eden in the young generation (unit: KB);
OC: Capacity of the old age (in KB);
OU: used space in the old age (unit: KB).
MC: capacity of the meta-space (unit: KB);
MU: the used space of the metspace (unit: KB).
YGC: Number of GC’s in the young generation from application startup to sampling;
YGCT: time from application startup to GC in the young generation at sampling time (s);
FGC: Number of GCS from application startup to Full Gc at sampling time;
FGCT: time from application startup to Full Gc at sampling time (s);
GCT: Total time (s) taken by GC from application startup to sampling time.

Jmap command

Jmap is also a JDK utility command that can view heap initialization information and heap memory usage, as well as generate dump files for detailed analysis. To check heap memory, run the jmap-heap pid command.

Mat Memory Tool

The Memory Analyzer Tool (MAT) is a plug-in of Eclipse (MAT can also be used independently). When it analyzes dump files of large Memory, it can intuitively see the Memory size occupied by each object in the heap space, the number of class instances, object reference relationship, and query with OQL objects. And can easily find the object GC Roots related information. The download address can be found here

Preparing for simulated environment

The basic environment JDK1.8, using the SpringBoot framework to write several interfaces to trigger the simulation scenario, the first is to simulate the CPU full situation

CPU filled

It’s easy to simulate full CPU, just write an infinite loop to calculate CPU consumption.

 		/** * Simulates full CPU */
    @GetMapping("/cpu/loop")
    public void testCPULoop(a) throws InterruptedException {
        System.out.println("Request CPU loop");
        Thread.currentThread().setName("loop-thread-cpu");
        int num = 0;
        while (true) {
            num++;
            if (num == Integer.MAX_VALUE) {
                System.out.println("reset");
            }
            num = 0; }}Copy the code

Curl localhost:8080/ CPU /loop

Run the top-hp 32805 command to check the status of Java threads

Run printf ‘%x’ 32826 to get the hexadecimal thread ID for dump information query. The result is 803a. Finally, we perform jstack 32805 | grep – A 20, 803 – A to check the detailed information on the dump.

Here the dump information directly locates the problem method and line of code, which locates the CPU full problem.

Memory leaks

A ThreadLocal is a thread-private variable that can be bound to a thread for the lifetime of the thread. However, due to the special nature of ThreadLocal, ThreadLocal is implemented based on ThreadLocalMap. The Entry of ThreadLocalMap inherits WeakReference, while the Key of Entry is the encapsulation of WeakReference. In other words, Key is WeakReference. Weak references are reclaimed after the next GC, and if ThreadLocal does nothing after the set, because the GC will clean up the Key, but the Value will never be reclaimed because the thread is still alive, then a memory leak will occur.

/** * simulates a memory leak */
    @GetMapping(value = "/memory/leak")
    public String leak(a) {
        System.out.println("Simulated memory leak");
        ThreadLocal<Byte[]> localVariable = new ThreadLocal<Byte[]>();
        localVariable.set(new Byte[4096 * 1024]);// Add variables to the thread
        return "ok";
    }
Copy the code

We put a heap size limit on startup and a stack snapshot and log output when memory runs out.

java -jar -Xms500m -Xmx500m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof -XX:+PrintGCTimeStamps - XX: + PrintGCDetails Xloggc: / TMP/heaplog log analysis - demo - 0.0.1 - the SNAPSHOT. The jar

For I in {1.. 500}; do curl localhost:8080/memory/leak; Done, the system has returned 500 errors before the execution is complete. The following exceptions occur when you view system logs:

java.lang.OutOfMemoryError: Java heap space
Copy the code

Let’s use the jstat -gc pid command to see the gc status of the program.

It is clear that memory is out of the heap, and the heap has not freed free memory after 45 Full Gc cycles. This indicates that objects in the heap are alive, with Gc Roots references and cannot be reclaimed. So what causes memory overflow? Do I just need to increase the memory? If it’s a normal memory leak, you might want to expand the memory, but if it’s a memory leak, the expanded memory will fill up soon, so we need to determine if it’s a memory leak. We saved heap Dump file before, this time use our MAT tool to analyze. The import tool selects the Leak Suspects Report, and the tool will list the problem Report directly to you.

Four suspected memory leak problems have been listed here, so let’s click on one to see more details.

It has already been indicated that memory is being consumed by threads in the vicinity of 50M, and the object being consumed is ThreadLocal. For a more detailed manual analysis, click on Histogram to see who is the largest object hog, and then analyze its reference relationships to determine who is responsible for the overflow.

The object that consumes the most memory is a Byte array. Let’s see if it is referenced by the GC Root so that it is not collected. According to the operation instructions in the red box, the result is as follows:

We see that the Byte array is referenced by thread objects, and as shown in the figure, the GC Root of the Byte array object is thread, so it is not collected. Expanding the details, we see that the final memory usage object is occupied by ThreadLocal objects. This is also consistent with the result that MAT tool automatically helps us to analyze.

A deadlock

Deadlocks can cause the JVM to run out of memory, causing it to fail to create threads. This is a good example of using a thread pool.

 ExecutorService service = new ThreadPoolExecutor(4.10.0, TimeUnit.SECONDS, new LinkedBlockingQueue<Runnable>(1024),
            Executors.defaultThreadFactory(),
            new ThreadPoolExecutor.AbortPolicy());
   /** * simulate deadlock */
    @GetMapping("/cpu/test")
    public String testCPU(a) throws InterruptedException {
        System.out.println("Request CPU");
        Object lock1 = new Object();
        Object lock2 = new Object();
        service.submit(new DeadLockThread(lock1, lock2), "deadLookThread-" + new Random().nextInt());
        service.submit(new DeadLockThread(lock2, lock1), "deadLookThread-" + new Random().nextInt());
        return "ok";
    }

public class DeadLockThread implements Runnable {
    private Object lock1;
    private Object lock2;

    public DeadLockThread(Object lock1, Object lock2) {
        this.lock1 = lock1;
        this.lock2 = lock2;
    }

    @Override
    public void run(a) {
        synchronized (lock2) {
            System.out.println(Thread.currentThread().getName()+"get lock2 and wait lock1");
            try {
                TimeUnit.MILLISECONDS.sleep(2000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            synchronized (lock1) {
                System.out.println(Thread.currentThread().getName()+"get lock1 and lock2 "); }}}}Copy the code

We looped the interface 2000 times, and soon the system got a log error, the thread pool and the queue were full, and because OF my choice of reject when the queue was full, the system threw an exception.

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@2760298 rejected from java.util.concurrent.ThreadPoolExecutor@7ea7cd51[Running, pool size = 10, active threads = 10, queued tasks = 1024, completed tasks = 846]
Copy the code

Through the ps – ef | grep Java command to find out the Java process pid, executing jstack pid can appear Java thread stack, here found five deadlock, we only list one, The thread pool-1-thread-2 has locked 0x00000000F8387d88, and the thread pool-1-thread-2 has locked 0x00000000F8387d88. The thread pool-1-thread-1 locked 0x00000000F8387d98 and waited for the lock 0x00000000F8387d88, which resulted in a deadlock.

Java stack information for the threads listed above:
===================================================
"pool-1-thread-2":
        at top.luozhou.analysisdemo.controller.DeadLockThread2.run(DeadLockThread.java:30)
        - waiting to lock <0x00000000f8387d98> (a java.lang.Object)
        - locked <0x00000000f8387d88> (a java.lang.Object)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
"pool-1-thread-1":
        at top.luozhou.analysisdemo.controller.DeadLockThread1.run(DeadLockThread.java:30)
        - waiting to lock <0x00000000f8387d88> (a java.lang.Object)
        - locked <0x00000000f8387d98> (a java.lang.Object)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
          
 Found 5 deadlocks.
Copy the code

Frequent thread switching

Context switching can waste a lot of CPU time on saving and restoring registers, kernel stacks, and virtual memory, which can degrade the overall system performance. When you see a significant performance degradation of your system, you need to consider whether there is a lot of thread context switching going on.

 @GetMapping(value = "/thread/swap")
    public String theadSwap(int num) {
        System.out.println("Simulate thread switching");
        for (int i = 0; i < num; i++) {
            new Thread(new ThreadSwap1(new AtomicInteger(0)),"thread-swap"+i).start();
        }
        return "ok";
    }
public class ThreadSwap1 implements Runnable {
    private AtomicInteger integer;

    public ThreadSwap1(AtomicInteger integer) {
        this.integer = integer;
    }

    @Override
    public void run(a) {
        while (true) {
            integer.addAndGet(1);
            Thread.yield(); // Allocate CPU resources}}}Copy the code

Here I create multiple threads to perform basic atoms + 1 operation, then yield the CPU resources, in theory, the CPU will go to scheduling another thread, we request the interface to create 100 thread to see how it works, the curl localhost: 8080 / thread/swap? Num = 100. After the interface request is successful, we execute ‘vmstat 1 10, which means printing once every 1 second and printing 10 times. The collection results of thread switching are as follows:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   incs us sy id wa st 101 0 128000 878384 908 468684 0 0 0 0 4071 8110498 14 86 0 0 0 100 0 128000 878384 908 468684 0 0 0 0 4065 8312463 15 85 00 100 0 128000 878384 908 468684 000 4107 8207718 14 87 000 100 0 128000 878384 908 468684 000 4083 8410174 14 86 000 100 0 128000 878384 908 468684 000 4083 8264377 14 86 000 100 0 128000 878384 908 468688 0 0 108 4182 8346826 14 86 0 0 0 0Copy the code

Here we focus on four indicators, R, CS, US and SY.

If r=100, the number of waiting processes is 100 and the thread is blocked.

Cs = more than 8 million, which means more than 8 million context switches per second, which is quite a lot.

Us =14, indicating that the user mode takes 14% of the CPU time slice to process logic.

Sy =86, indicating that the kernel mode consumes 86% of the CPU.

We used the top command and the top-hp PID command to check the CPU usage of the process and thread. We found that the Java thread CPU usage was full, but the thread CPU usage was very average. No one thread used up the CPU.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                            
 87093 root      20   0 4194788 299056  13252 S 399.7 16.1  65:34.67 java 
Copy the code

 PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                             
 87189 root      20   0 4194788 299056  13252 R  4.7 16.1   0:41.11 java                                                                                
 87129 root      20   0 4194788 299056  13252 R  4.3 16.1   0:41.14 java                                                                                
 87130 root      20   0 4194788 299056  13252 R  4.3 16.1   0:40.51 java                                                                                
 87133 root      20   0 4194788 299056  13252 R  4.3 16.1   0:40.59 java                                                                                
 87134 root      20   0 4194788 299056  13252 R  4.3 16.1   0:40.95 java 
Copy the code

Given that only 14% of the user mode CPU is used and 86% of the kernel mode CPU is used, we can basically determine that the performance problem is caused by the thread context switch of Java programs.

Pidstat -p 87093 -w 1 10

11:04:30 PM UID TGID dar CSWCH NVCSWCH/s/s Command 11:04:30 PM 0-87128 0.00 16.07 | __java 11:04:30 PM 0.00 0-87129 15.60 | __java 11:04:30 PM 0-87130 0.00 15.54 | __java 11:04:30 PM 15.60 | 0-87131-0.00 __java 11:04:30 PM 0-87132 0.00 15.43 | __java 11:04:30 PM 0-87133 0.00 16.02 | __java 11:04:30 PM 15.66 | 0-87134-0.00 __java 11:04:30 PM 0 87135 0.00 15.23 | __java 11:04:30 PM 0-87136 0.00 15.33 | __java 11:04:30 PM 16.04 | 0-87137-0.00 __javaCopy the code

Based on the information gathered above, we know that Java threads switch about 15 times per second, which would normally be in digits or decimals. Combined with this information, we can conclude that too many Java threads are opened, leading to frequent context switches that affect overall performance.

Why does the system switch context more than 8 million times per second, when a single thread in the Java process only switches about 15 times?

System context switching can be divided into three situations:

1. Multitasking: In a multitasking environment, a context switch occurs when a process is switched from the CPU to run another process.

2. Interrupt processing: When an interrupt occurs, the hardware will switch the context. In the vmstat command it is in

3. Switch between user and kernel mode: When the operating system needs to switch between user mode and kernel mode, context switch is required, such as system function call.

Linux maintains a ready queue for each CPU, sorting the active processes by priority and how long they have waited for the CPU, and then selecting the processes that need the most CPU, that is, the processes that have the highest priority and have waited for the CPU the longest, to run. That’s r in the vmstat command.

When will a process be scheduled to run on the CPU?

When the process terminates, its previously used CPU is freed and a new process is taken from the ready queue to run
To ensure that all processes are scheduled fairly, CPU time is divided into time slices that are allocated to each process in turn. When a process runs out of time, it is suspended by the system and switched to another process waiting for the CPU.
When system resources are insufficient, a process can run only after the resources are sufficient. In this case, the process is suspended and the system schedules other processes to run.
When a process is actively suspended via the sleep function, it is also rescheduled.
When a process with a higher priority runs, the current process is suspended to ensure that the process with a higher priority runs.
When a hardware interrupt occurs, processes on the CPU are suspended by the interrupt and run interrupt service routines in the kernel instead.

Considering our previous content analysis, the blocking ready queue is around 100, and our CPU only has 4 cores, the context switch due to this part can be quite high. Add in the interrupt number is around 4000 and the system function calls, etc., and it is not surprising that the whole system context switch is around 8 million. Internal Threads in Java switch only 15 times because threads use Thread.yield() to yield CPU resources, but the CPU may continue to schedule the Thread. There is no switch between threads, which is why internal threads do not switch very many times.

conclusion

This article simulated common performance problem scenarios and analyzed how to locate cpu100%, memory leaks, deadlocks, and frequent thread switching problems. To analyze problems, we need to do two things well. First, we need to master the basic principles. Second, we need to use good tools. This article also lists common tools and commands for problem analysis to help you solve problems. Of course, the real online environment may be very complicated and not as simple as the simulated environment, but the principle is the same and the problem performance is similar. We focus on the principle, learn and apply flexibly, and believe that complex online problems can be solved smoothly.

reference

1, linux.die.net/man/1/pidst…

2, linux.die.net/man/8/vmsta…

3, help.eclipse.org/2020-03/ind…

Write in the last

Due to the feedback from many friends recently, my article was opened on the official account of the original, but I did not authorize, for this kind of behavior, I express disgust and objection. Personal hope very much good articles can be widely spread, can help more people, but the act of plagiarism as very despised, in order to protect the copyright of the article, I applied for his public – wooden carpenter’s circle of friends, hope that interested friends can watch, because it is new, there is no content, later will update in succession, thanks for your support.