This article is participating in “Java Theme Month – Java Debug Notes Event”, see < Event link > for more details.

Java application performance optimization is a commonplace topic, typical performance problems such as slow page response, interface timeout, high server load, low concurrency, frequent database deadlock and so on. Especially in the “rough and fast” Internet development mode is popular today, with the increasing number of system visits and bloated code, all kinds of performance problems began to come.

Java application performance bottlenecks are numerous, such as disk, memory, network I/O and other system factors, Java application code, JVM GC, database, cache, etc. Based on my personal experience, the author divides Java performance optimization into four levels: application layer, database layer, framework layer, and JVM layer, as shown in Figure 1.

Figure 1.Java performance optimization layered model

The difficulty of each layer of optimization increases step by step, and the knowledge involved and problems solved will be different. For example, the application layer needs to understand the code logic and locate the problematic lines of code through the Java thread stack. The database level needs to analyze SQL, locate deadlocks, etc. Framework layer needs to understand the source code, understand the framework mechanism; The JVM layer needs to have an in-depth understanding of the types and workings of the GC, as well as the effects of various JVM parameters.

There are two basic analysis methods for Java performance optimization: field analysis and post hoc analysis.

Field analysis preserves the site and then uses diagnostic tools to analyze and locate the site. Onsite analysis has a great impact on online services, and some scenarios (especially when critical online services are involved) are inappropriate.

Post hoc analysis requires the collection of as much field data as possible, and the immediate restoration of service, as well as post-hoc analysis and replication of the collected field data. Let’s share some cases and practices starting with performance diagnostic tools.

1 Performance diagnostic tool

One type of performance diagnosis is to diagnose the system and code that has been confirmed to have performance problems. The other type is to test the performance of the system before going online in advance to determine whether the performance meets the requirements of going online.

This article focuses on the former; the latter can be tested with various performance pressure tools, such as JMeter, and is beyond the scope of this article.

** For Java applications, performance diagnostics are divided into two main layers: **OS level and Java application level (including application code diagnostics and GC diagnostics).

OS diagnosis

OS diagnosis focuses on CPU, Memory, and I/O.

2 CPU diagnosis

For THE CPU, focus on Load Average, CPU usage, and Context Switch times.

You can run the top command to view the system load average and CPU usage. Figure 2 shows the status of a certain system.

Figure 2. Example of the top command

The average load has three numbers: 63.66, 58.39 and 57.18, which indicate the load of the machine in the past 1 minute, 5 minutes and 15 minutes respectively. As a rule of thumb, if the value is less than 0.7 x cpus, the system works normally. Beyond this, or even four or five times the number of CPU cores, the system is significantly overloaded.

In Figure 2, the 15-minute load is as high as 57.18, and the 1-minute load is 63.66 (the system has 16 cores), indicating that the system has a load problem and there is a further trend of increase, and the specific cause needs to be located.

You can view the number of CPU context switches using the vmstat command, as shown in Figure 3:

Figure 3. Example of the vmstat command

Context switching times occur in the following scenarios:

  • 1) When the time slice runs out, the CPU normally schedules the next task;

  • 2) It is preempted by other tasks with higher priority;

  • 3) When I/O is blocked during task execution, the current task is suspended and the task is switched to the next task.

  • 4) User code actively suspends the current task to give up CPU;

  • 5) Multi-task resource preemption, which is suspended because it is not grabbed;

  • 6) Hardware interruption.

Java thread context switching comes primarily from competing for shared resources. Locking a single object is rarely a system bottleneck unless the lock granularity is too large. However, in a code block with high access frequency and continuous locking of multiple objects, a large number of context switches may occur, which becomes the bottleneck of the system.

For example, in our system, log4j 1.x prints a large number of logs at a large concurrency, causes frequent context switches, and blocks a large number of threads, resulting in a large throughput drop, as shown in Listing 1. Upgrading to log4j 2.x solved this problem.

for(Category c = this; c ! = null; C =c.parent) {// Protected against simultaneous call to addAppender, removeAppender,... synchronized(c) { if (c.aai ! = null) { write += c.aai.appendLoopAppenders(event); }... }}this; c ! = null; C =c.parent) {// Protected against simultaneous call to addAppender, removeAppender,... synchronized(c) { if (c.aai ! = null) { write += c.aai.appendLoopAppenders(event); }... }}Copy the code

3 Memory

From the perspective of the operating system, memory concerns whether the application process is sufficient. You can run the free -m command to check the memory usage.

You can run the top command to view the virtual memory VIRT and physical memory RES used by processes. Based on the VIRT = SWAP + RES formula, you can calculate the SWAP partition used by specific applications. Excessive SWAP partition usage may affect the performance of Java applications. You can set the swappiness value to as small as possible.

For Java applications, taking up too many swap partitions can affect performance, since disk performance is much slower than memory.

4 I/O

I/O includes disk I/O and network I/O. Disks are more prone to I/O bottlenecks. You can run the iostat command to check the disk read/write status. You can run the CPU I/O wait command to check whether the disk I/O is normal.

If the disk I/O status is always high, the disk is slow or faulty, causing performance bottlenecks and requiring application optimization or disk replacement.

In addition to common commands such as top, ps, vmstat, and iostat, other Linux tools can be used to diagnose system problems, such as mpstat, tcpdump, netstat, pidstat, and SAR. Brendan summarizes the performance diagnostic tools for Linux for different device types, as shown in Figure 4.

Figure 4.Linux performance measurement tool

5 Java Application Diagnostics and tools

Application code performance problems are relatively easy to solve. Through some application level monitoring alarm, if the problem of the function and code, directly through the code can be located; Alternatively, by using top+ JStack, you can find the faulty thread stack and locate the code of the faulty thread to find the problem. For more complex and logical code segments, Stopwatch printing performance logs also tends to locate most application code performance issues.

Common Java application diagnostics include threading, stack, GC, and so on.

jstack

The jstack command is usually used with top to locate Java processes and threads using the top -h -p PID command, and export the thread stack using the jstack -l pid command. Since the thread stack is transient, it needs to be dumped multiple times, usually 3 times, usually every 5 seconds. The Java thread PID located by TOP is converted into hexadecimal to obtain the NID in the Java thread stack, and the corresponding problem thread stack can be found.

Figure 5. Long running Java threads through top — H-P

As shown in FIG. 5, thread 24985 runs for a long time and may have problems. After converting to hexadecimal, the stack corresponding to thread 0x6199 is found as follows through Java thread stack to locate the problem, as shown in FIG. 6.

Figure 6. Jstack looks at the thread stack

JProfiler

JProfiler is powerful for analyzing CPU, heap, and memory, as shown in Figure 7. At the same time, combined with the pressure measurement tool, the code time sampling statistics.

Figure 7. Memory analysis by JProfiler

6 GC diagnosis

The Java GC takes care of the programmer’s risk of managing memory, but application pauses caused by GC are another issue that needs to be addressed. The JDK provides a number of tools to locate GC problems, including Jstat, JMap, and third-party tools such as MAT.

jstat

The jstat command prints GC details, Young and Full GC counts, heap information, and more. The command format is

Jstat -gcxxx -t PID, as shown in Figure 8.

Figure 8. Example of the jstat command

jmap

Jmap Displays information about the Java process heap jMAP-heap PID. You can use jmap — dump:file= XXX PID to dump the heap to a file, and then use other tools to further analyze the heap usage

MAT

MAT is a powerful tool for analyzing Java heap, providing intuitive diagnostic reports. The built-in OQL allows SQL queries to the heap, which is powerful. Outgoing reference and incoming reference can trace object references to the source.

Figure 9. MAT sample

Fig.9 shows an example of MAT. MAT has two columns showing the size of the object, Shallow size and Retained size respectively. The former represents the size of the memory occupied by the object itself, excluding the referenced object. The latter is the sum of the Shallow size of the object itself and its directly or indirectly referenced objects, which is the amount of memory freed by the GC after the object is reclaimed, and is generally the only thing to focus on.

For some large (tens of gigabytes) Java applications, large memory is required to open MAT.

Generally, the memory of the local development machine is too small to open. You are advised to install a graphical environment and MAT on the offline server and open it remotely. Or run the mat command to generate a heap index and copy the index locally, but this way you will see limited heap information.

To diagnose GC problems, it is recommended to add -xx :+PrintGCDateStamps to the JVM parameters. The common GC parameters are shown in Figure 10.

Figure 10. Common GC parameters

** For Java applications, you can use top+ JSTACK + JMAP +MAT to locate most applications and memory problems, which is a necessary tool. ** In some cases, Java application diagnostics need to refer to OS-related information, and more comprehensive diagnostic tools such as Zabbix, which integrates OS and JVM monitoring, can be used. In distributed environments, infrastructure such as distributed tracking systems also provide strong support for application performance diagnostics.

7 Performance Optimization practices

After introducing some of the commonly used performance diagnostics tools, we’ll share examples from the JVM layer, the application code layer, and the database layer, combining some of our practices in Java application tuning.

JVM tuning: GC pain

During the system reconstruction of XX commercial platform, RMI was selected as the internal remote call protocol. After the system went online, periodic service stop response began to occur, and the pause time ranged from several seconds to tens of seconds. By observing the GC log, you can see that a Full GC occurs every hour since the service started. Due to the large system heap Settings, Full GC will suspend application for a long time, which has a great impact on online real-time services.

After analysis, the system did not have regular Full GC before reconstruction, so it was suspected that there was a problem at the RMI framework level. Through the exposure, it was discovered that RMI’s Distributed Garbage Collection (GDC) starts the daemon thread, shown in Listing 2, to perform Full GC periodically to reclaim remote objects.

Listing 2.DGC daemon thread source code

private static class Daemon extends Thread {public void run() {for (;;) {/ /... long d = maxObjectInspectionAge(); if (d >= l) {System.gc(); d = 0; } / /... }}}Copy the code

Once the problem is located, it is easier to solve it. One is to directly disable display calls to the system GC by adding the -xx :+DisableExplicitGC parameter, but for systems using NIO, there isa risk of out-of-heap memory overflow.

Another way is through the large – Dsun. Rmi. DGC. Server gcInterval and – Dsun. Rmi. DGC. Client. GcInterval parameters, increase the Full GC interval, At The same time increase The parameter – XX: + ExplicitGCInvokesConcurrent, will be a Full Stop – The – World Full of GC is adjusted for a concurrent GC cycle, reduce The pause time, at The same time also won’t affect for NIO applications.

As you can see from Figure 11, the number of Full GC after adjustment decreased significantly after march.

Figure 11.Full GC monitoring statistics

GC tuning is still necessary for high-concurrency, high-volume interactions, especially since default JVM parameters often do not meet business requirements and need to be tuned specifically. There is a lot of public information about interpreting GC logs that is not covered in this article.

There are three basic ideas for GC tuning: reduce GC frequency by increasing heap space and reducing unnecessary object generation; Reducing GC pause time can be achieved by reducing heap space and using the CMS GC algorithm. Avoid Full GC, adjust the CMS trigger ratio, avoid Promotion Failure and Concurrent mode Failure (older generation allocates more space, increases the number of GC threads to speed up collection), reduce large object generation, etc.

Application layer tuning: Smelling bad code

It is undoubtedly a good way to improve the performance of Java applications to analyze the root of code efficiency decline from the application layer code tuning.

After a commercial advertising system (using Nginx for load balancing) went online on a daily basis, several of the machines had a sharp increase in load, and the CPU utilization was quickly hit full. We did an emergency rollback online and saved one of the servers live via JMap and JStack.

Figure 12. Stack field analysis through MAT

The stack site is shown in Figure 12. According to MAT’s analysis of dump data, it is found that byte[] and java.util.HashMapEntry are the most memory objects, and java.util.HashMapEntry, And java.util.HashMapEntry, and the java.util.HashMapEntry object has a circular reference. It was initially identified that there might be an infinite loop problem during the put of the HashMap (the next references to java.util.HashMap $Entry 0x2Add6D992CB8 and 0x2ADD6D992ce8 form a loop).

Refer to relevant document positioning it belongs to the typical concurrent use of scene error (bugs.java.com/bugdatabase…). In brief, HashMap itself does not have the concurrent characteristics of multiple threads. In the case of simultaneous PUT operations by multiple threads, the internal array expansion will lead to the formation of a ring structure in the internal linked list of HashMap, resulting in an infinite loop.

The biggest change for this launch is to improve system performance by in-memory caching of website data, while using lazy loading, as shown in Listing 3.

Listing 3. Lazy loading code for web site data

private static Map<Long, UnionDomain> domainMap = new HashMap<Long, UnionDomain>(); Private Boolean isResetDomains() {if (collectionutils.isempty (domainMap)) {// Get website details from remote HTTP interface List<UnionDomain> newDomains = unionDomainHttpClient.queryAllUnionDomain(); if (CollectionUtils.isEmpty(domainMap)) {domainMap = new HashMap<Long, UnionDomain>(); for (UnionDomain domain : newDomains) {if (domain ! = null) {domainMap.put(domain.getSubdomainId(), domain); }}}return true; }return false; }Copy the code

It can be seen that the domainMap is a static shared resource, which is of the HashMap type. In the case of multiple threads, the internal linked list will form a ring structure and appear an infinite loop.

It can be seen from the connection and access log of front-end Nginx that a large number of user requests were accumulated after the system was restarted and started with Resin container. A large number of user requests flooded into the application system, and multiple users simultaneously requested and initialized website data, resulting in the concurrency problem of HashMap. After locating the cause of the fault, the solution is relatively simple. The main solutions are as follows:

  • (1) Use ConcurrentHashMap or block synchronization to solve the above concurrency problems;

  • (2) Complete the website cache loading before the system starts, remove lazy loading, etc.;

  • (3) Replace local cache with distributed cache.

For the location of bad code, in addition to the code review in the conventional sense, with the help of tools such as MAT can also quickly locate the system performance bottlenecks to a certain extent. However, in the case of binding to a specific scenario or binding to service data, auxiliary code walk-throughs, performance detection tools, data simulation, and even online traffic diversion are required to finally identify the source of performance problems. Here are some possible characteristics of bad code for your reference:

  • (1) The code readability is poor and there is no basic programming specification;

  • (2) Excessive object generation or large object generation, memory leakage, etc.;

  • (3) IO stream operation is too much, or forget to close;

  • (4) Too many database operations and too long transactions;

  • (5) The scene used in synchronization is wrong;

  • (6) Cycle iteration time-consuming operation, etc.

Database layer tuning: deadlock nightmare

For most Java applications, it is common to interact with databases, especially for OLTP applications that require high data consistency. The performance of the database directly affects the performance of the entire application. Sogou commercial platform system as advertisers advertising and delivery platform, the real-time and consistency of its materials have high requirements, we have also accumulated some experience in relational database optimization.

For the advertising material library, the high operation frequency (especially through the batch material tool operation) is very easy to cause the database deadlock situation, one of the more typical scenario is advertising material price adjustment. Customers often adjust the bidding of materials frequently, which indirectly causes great load pressure to the database system, and also aggravates the possibility of deadlock. The following to sogou commercial platform advertising system advertising material price adjustment case for illustration.

A commercial advertising system has a sudden increase in traffic on a certain day, resulting in increased system load and frequent database deadlocks. Deadlock statements are shown in Figure 13.

Figure 13. Deadlock statements

The indexes in the groupDomain table are idx_groupDomain_accountid (accounTID), idx_groupDomain_groupid (groupid), Primary (GroupDomainID) three single index structures, using Mysql InnoDB engine.

This scenario occurs when group bids are updated, and there are groups, group industries (GroupINDus table), and group Sites (GroupDomain table) in the scenario.

When updating group bids, group bids are used for group industry bids (marked by isuseGroupPrice or group bids for 1). Also, if the group site bid uses the group industry bid (marked by isuseindusprice, or if it is 1, uses the group industry bid), it needs to update its group site bid at the same time. Since there can be up to 3000 sites under each group, records are locked for a long time when a group bid is updated.

As you can see from the deadlock problem above, the single-column index of idx_groupDomain_accountid was selected for both transaction 1 and transaction 2. The Mysql InnoDB engine locks only one index in a transaction, and attempts to lock primary key indexes if secondary indexes are used. The secondary index of ‘idx_groupDomain_accountid’ held by transaction 1 (space ID 5726 Page no 8658 N bits 824 index) is locked. However, transaction 2 has acquired the lock on the secondary index (” space ID 5726 Page no 8658 n bits 824 index “) and is waiting for the request to lock the lock on the PRIMARY key index. Transaction 1 eventually rolled back because transaction 2 waited too long or did not release the lock.

As can be seen from the daily access log tracking, a large number of clients initiated a script to modify the bid of the promotion group, resulting in a large number of transactions waiting in a loop for the previous transaction to release the locked PRIMARY key. The root cause of this problem is actually the Mysql InnoDB engine’s limited use of indexes, and this problem is not prominent in Oracle databases.

The solution, of course, is to try to lock as few records as possible in a single transaction, thus greatly reducing the probability of deadlock. Finally, the compound index (Accountid, grouPID) is used to reduce the number of records locked by a single transaction, and to achieve the isolation of promotion group data records under different plans, thus reducing the probability of such deadlock.

Generally speaking, for database layer tuning we will basically start from the following aspects:

(1) Optimization at the SQL statement level: slow SQL analysis, index analysis and tuning, transaction splitting, etc.;

(2) Optimization at the database configuration level, such as field design, cache size adjustment, disk I/O and other database parameters optimization, data fragmentation, etc.;

(3) Optimize from the database structure level: consider the vertical split and horizontal split database;

(4) Choose the right database engine or type to adapt to different scenarios, such as considering the introduction of NoSQL, etc.

7 Summary and suggestions

Performance tuning also follows the 2-8 principle, 80% of performance problems are caused by 20% of the code, so optimizing the key code is twice as effective. At the same time, the optimization of performance should be optimized on demand, excessive optimization may introduce more problems. For Java performance optimization, you need to understand not only the system architecture and application code, but also the JVM layer and even the underlying operating system. To sum up, we can mainly consider the following points:

1) Basic performance tuning

The basic performance here refers to hardware level or operating system level upgrade optimization, such as network tuning, operating system version upgrade, hardware device optimization, etc. For example, the use of F5 and the introduction of SDD hard disk, including the new version of Linux in NIO upgrade, can greatly promote the performance of applications;

2) Database performance optimization

Including common transaction splitting, index tuning, SQL optimization, the introduction of NoSQL, such as the introduction of asynchronous processing in transaction splitting, finally achieve consistency and other practices, including the introduction of all kinds of NoSQL database for specific scenarios, can greatly alleviate the traditional database in high concurrency under the deficiency;

3) Application architecture optimization

Some new computing or storage frameworks are introduced to solve the bottleneck of the original cluster computing performance by using new features. Or the introduction of distributed strategy, in the calculation and storage of horizontal, including pre-processing of advance calculation, the use of typical space for time; Can reduce the system load to a certain extent;

4) Optimization at the business level

Technology is not the only means to improve system performance. In many scenarios where performance problems occur, it can be seen that most of them are caused by special business scenarios. In fact, it is often the most effective way to avoid or adjust the performance problems in business.