Database connection pool memory leak analysis and solution

Making the address

Github.com/whx123/Java…

I. Problem description

Last Friday night, some equipment went offline in the main business. After checking the logs, it was found that it was caused by a long time GC in the cache system. The characteristics of the GC log here are:

1. The GC time is more than 2s, and some nodes even have ultra-long GC time of 12s.
2. The interval between the last GC for a node is generally 13 to 15 days.

This object stacks up to 10140

The preliminary judgment for a long time the gc problem should be due to com. Mysql. JDBC. NonRegisteringDriver $ConnectionPhantomReference this object caused substantial accumulation.

Ii. Problem analysis

The database dependencies in the current formal environment are as follows:

Rely on	version
mysql	5.1.47
hikari	2.7.9
Sharding-jdbc	3.1.0

Based on the above description, the following questions are raised:

1, com. Mysql. JDBC. NonRegisteringDriver $ConnectionPhantomReference exactly what the object is?
2. Why do these objects accumulate so much that the JVM can’t recycle them?

What object NonRegisteringDriver $ConnectionPhantomReference?

In simple terms, the NonRegisteringDriver class has a collection of virtual references connectionPhantomRefs to store all database connections, NonRegisteringDriver. TrackConnection method is responsible for the newly created connection in connectionPhantomRefs collection. The source code is as follows:

1.public class NonRegisteringDriver implements java.sql.Driver { 2. protected static final ConcurrentHashMap<ConnectionPhantomReference, ConnectionPhantomReference> connectionPhantomRefs = new ConcurrentHashMap<ConnectionPhantomReference, ConnectionPhantomReference>(); 3. protected static final ReferenceQueue<ConnectionImpl> refQueue = new ReferenceQueue<ConnectionImpl>(); 4. 5.... 6. 7. protected static void trackConnection(Connection newConn) { 8. 9. ConnectionPhantomReference phantomRef = new ConnectionPhantomReference((ConnectionImpl) newConn, refQueue); 10. connectionPhantomRefs.put(phantomRef, phantomRef); 11.} 12.... 13.}Copy the code

We follow the process to create a database connection source code, find it will be transferred to the com. Mysql. JDBC. ConnectionImpl constructor, this method will be called createNewIO method to create a new database connection MysqlIO object, Then call we mentioned above NonRegisteringDriver. TrackConnection method, put the object in the NonRegisteringDriver. ConnectionPhantomRefs collection. The source code is as follows:

1.public class ConnectionImpl extends ConnectionPropertiesImpl implements MySQLConnection {  
2.	  
3.	   public ConnectionImpl(String hostToConnectTo, int portToConnectTo, Properties info, String databaseToConnectTo, String url) throws SQLException {  
4.	        ...  
5.	       createNewIO(false); 6.... 7. NonRegisteringDriver.trackConnection(this); 8.... 9. 10.}}Copy the code

ConnectionPhantomRefs is a collection of virtual references. What is a virtual reference? Why a virtual reference queue

Virtual reference queues, also known as ghost references, are the weakest type of reference relationship.

If an object holds only virtual references, it can be collected by the garbage collector at any time, just as if there were no references at all.

The sole purpose of setting a virtual reference association for an object is to receive a system notification when the object is reclaimed by the collector.

When the garbage collector attempts to reclaim an object and finds that it still has a virtual reference, it queues the reference after the garbage collection and does not completely destroy the object until its associated virtual reference is unqueued. So you can tell if an object has been reclaimed by checking if there are corresponding virtual references in the reference queue.

ConnectionPhantomRefs = connectionPhantomRefs = connectionPhantomRefs = connectionPhantomRefs

Here the hiKARICP data configuration in the project and the official document combined with the description ~

Hikaricp datapool hikarICP datapool

maximumPoolSize

This property controls the maximum size that the pool is allowed to reach, including both idle and in-use connections. Basically this value will determine the maximum number of actual connections to the database backend. A reasonable value for this is best determined by your execution environment. When the pool reaches this size, and no idle connections are available, calls to getConnection() will block for up to connectionTimeout milliseconds before timing out. Please read about pool sizing. Default: 10

MaximumPoolSize controls the maximum number of connections. The default is 10

minimumIdle

This property controls the minimum number of idle connections that HikariCP tries to maintain in the pool. If the idle connections dip below this value and total connections in the pool are less than maximumPoolSize, HikariCP will make a best effort to add additional connections quickly and efficiently. However, for maximum performance and responsiveness to spike demands, we recommend not setting this value and instead allowing HikariCP to act as a fixed size connection pool. Default: same as maximumPoolSize

MinimumIdle controls the minimum number of connections. The default value is maximumPoolSize, 10.

⌚ idleTimeout

This property controls the maximum amount of time that a connection is allowed to sit idle in the pool. This setting only applies when minimumIdle is defined to be less than maximumPoolSize. Idle connections will not be retired once the pool reaches minimumIdle connections. Whether a connection is retired as idle or not is subject to a maximum variation of +30 seconds, and average variation of +15 seconds. A connection will never be retired as idle before this timeout. A value of 0 means that idle connections are never removed from the pool. The minimum allowed value is 10000ms (10 seconds). Default: 600000 (10 minutes)

If the connection idle time exceeds idleTimeout (10 minutes by default), the connection will be discarded

⌚ maxLifetime

This property controls the maximum lifetime of a connection in the pool. An in-use connection will never be retired, only when it is closed will it then be removed. On a connection-by-connection basis, minor negative attenuation is applied to avoid mass-extinction in the pool. We strongly recommend setting this value, and it should be several seconds shorter than any database or infrastructure imposed connection time limit. A value of 0 indicates no maximum lifetime (infinite lifetime), subject of course to the idleTimeout setting. Default: 1800000 (30 minutes)

If the connection lifetime exceeds maxLifetime (30 minutes by default), the connection will be discarded.

Let’s go back to the hikari configuration of the project:

MinimumIdle = 10, maximumPoolSize = 50, idleTimeout and maxLifetime are not configured. Therefore, the default values idleTimeout = 10 minutes and maxLifetime = 30 minutes are used for these two entries.
That is, if the database connection pool is full and there are 50 connections, if the system is idle, 40 connections will be discarded after 10 minutes (over idleTimeout); If the system is constantly busy, 50 connections will be discarded after 30 minutes (exceeding maxLifetime).

Guess the root of the problem:

Each time a new database connection is created, that connection is put into the connectionPhantomRefs collection. Data connections are discarded when the idle time exceeds idleTimeout or the lifetime exceeds maxLifetime and are waiting to be reclaimed in the connectionPhantomRefs collection. Because connected resources generally survive for a long time, they can generally survive to old age after many Young GC. If the database connection object itself is old, the elements in connectionPhantomRefs will pile up until the next full GC. If the connectionPhantomRefs collection has too many elements by the time the full GC is complete, the full GC will be very time-consuming.

So what’s the solution? Consider optimizing minimumIdle, maximumPoolSize, idleTimeout, and maxLifetime. In the next section, we will analyze a wave

Third, problem verification

On-line simulation environment

To verify the problem, we need to simulate the online environment and adjust parameters such as maxLifetime.

1. The configuration on the simulation line of the cache system, use the pressure test system to continuously press the cache system for a period of time, so that the cache system can create/abandon a large number of database connections in a short time, and observe whether the NonRegisteringDriver objects are piled up as scheduled. Then call System.gc() manually to see if the NonRegisteringDriver object is cleaned.
2. Adjust the maxLifetime parameter to check whether the NonRegisteringDriver object continues to accumulate during the same pressure test period.

Here are some points to note:

1. NonRegisteringDriver meets the condition of entering the old age only when the gc interval * the number of times the new generation can survive before entering the old age < maxLifetime.
MinimumIdle = 10, maximumPoolSize = 50 (minimumIdle = 10, maximumPoolSize = 50) MaxLifetime: 100s (GC duration is about 20s, so greater than 20 x 3 = 60s). This is expected to generate 10 new connections every 30 seconds under continuous manometry (even if maximumPoolSize = 50, this program can handle 10 connections under manometry)
3. The project memory allocation is smaller, and the survival times of the new generation before entering the old age are adjusted to be smaller, so that the NonRegisteringDriver objects of the new generation can enter the old age in a shorter time, and the obvious object growth can be observed in a shorter time.
4. Monitor the connection survival of the cache system data connection pool, as well as the system GC.

The final environment configuration is as follows:

Simulation results

Enable the JVisualVM tool to observe the cache system in real time
Enable hikari-related debug logs to view connection pool information

Set maxLifetime to 100s to enable the cache system

Verify that hiKARi and JVM configurations take effect

Start pressure test procedure, pressure test for 1000s

The GC logs were observed during the period. The GC interval was about 20 seconds, and 5 GC occurred after 100 seconds

Keep observing, and after 1000s you will theoretically produce 220 objects (20 + 20 * 1000s / 100s), see JVisualVM below

Analysis of experimental results

Combined with our production problems, we assume that there are 14 hours of peak period (12:00 ~ 2:00 AM) every day, and the number of connections is 20 and 10 hours of low peak period, and the number of connections is 10. Each full GC interval is 14 days. When the next full GC piles up NonRegisteringDriver objects (20 * 14 + 10 * 10) * 2 * 14 = 10640, The number of NonRegisteringDriver objects in dump is 10140.

So far the root of the problem has been completely confirmed!!

Iv. Solutions

As can be seen from the above analysis, the problem caused by the accumulation of discarded database connection objects, resulting in a long time for full GC. So we can think of solutions in terms of:

1. Reduce the generation and accumulation of abandoned data connection objects.
2. Optimize the full GC time.

[Adjust hikari parameters]

We can consider setting maxLifetime to a large value to extend the lifetime of the connection and reduce the frequency of discarded database connections, so that there are fewer database connection objects that need to be cleaned up by the next full GC.

Hikari recommends that maxLifetime be set to 30 to 1 minutes less than the database wait_timeout. If you are using mysql database, use show global variables like ‘%timeout%’; Look at wait_timeout, which defaults to 8 hours.

To start the validation, set maxLifetime = 1 hour, other conditions unchanged. Observe the JVisualVM before the pressure test starts, and the number of NonRegisteringDriver objects is 20

MinimumIdle =10 maximumPoolSize=10 ~ 20 minimumIdle=10 maximumPoolSize= 20

[Using G1 Collector]

G1 collector is the latest achievement of Java garbage collector. It is an excellent collector with low latency and high throughput. Users can customize the maximum pause time target.

The next step is to verify the G1 collector’s usefulness, which requires a long period of observation with the help of skywalking, a link-tracking tool. After 10 days of observation, the result is shown as follows: with G1 collector, some JVM parameters -xMS3g -XMX3G-xx :+UseG1GC

Using the Java 8 default Parallel GC collector combination, partial JVM parameters -xMS3g -XMx3G

1. Heap memory, divided into used and free memory.
2, method area memory, this does not need to worry about
3. Young GC and Full GC time
4. The number of young GC and full GC after the program started

We can see that the service using the Parallel GC collector combination consumes faster memory, with 6996 Young GC and one full GC occurring for up to 5s. The other group of services using the G1 collector consumed relatively flat memory, with only 3827 Young GCS and no full GCS. This shows that the G1 collector can indeed be used to solve our database connection object heap problem.

[Establish inspection system]

We haven’t done this yet, but according to the above analysis results, triggering full GC on a regular basis can achieve the effect of clearing a small number of accumulated database connections at a time, and avoid too many accumulated database connections. This approach requires familiarity with the content of the business and the peaks and valleys of the cycle. The implementation idea reference is as follows:

1. Create a Java program that calls System.gc() periodically using a scheduled task. The downside of this approach is that even if system.gc () is called manually, the JVM may not immediately start collecting, but may choose the optimal time to do so based on its own algorithm.
2. Create a shell script and call jmap-dump :live,file=dump_001.bin PID. Use the Linux crontab task to ensure scheduled execution. This method can guarantee the occurrence of full GC, but the disadvantage is that the function is too single and scattered, which is not easy to centralized management.

Five, the summary

The root cause of our problem this time is the database connection object pile up, resulting in the full GC time is too long. The solution can be started from the following three points:

1. Adjust hikari parameters. For example, if maxLifetime is set to a larger value (30 seconds less than wait_timeout), the minimumIdle and maximumPoolSize values cannot be set too large, or the default values can be used.
2. Use G1 garbage collector.
3. Establish inspection system and actively trigger full GC in business peak period.

Personal public account

If you are a good boy who loves learning, you can follow my public account and study and discuss with me.
If you feel that this article is not correct, you can comment, you can also follow my public account, private chat me, we learn and progress together.
Github address: github.com/whx123/Java…