An avalanche caused by improper HttpClient setup!

I. Event background

Recently, I have operated and maintained a real-time interface service on the Internet. Recently, the Address already in use (Bind failed) problem often appears.

Obviously, the problem is a port binding conflict, so I checked the network connection and port usage of the current system, and found that a large number of connections in TIME_WAIT have been occupying the port, resulting in the full port (6W + connections at the highest time). Therefore, HttpClient will have port conflict when establishing a connection.

The details are as follows:

So in order to solve the problem of time_wait, the Internet search some information and their own thinking, so that we can use the connection pool to save TCP connections, reduce the number of concurrent HttpClient randomly open ports, reuse the original valid connection. But new problems are also introduced with connection pooling Settings.

2. Problem process

When estimating the maximum number of connections in the connection pool, it refers to the request volume of 1.2W PV per minute at the peak of the service, and the flat response of the interface is 1.3s (complex advertising effect simulation system, high flat response in this scenario is the reason for business needs).

Therefore, QPS is 12000*1.3\60=260

According to the service logs, each connection takes about 1.1s to set up, and 70%+ space is left (in case the number of connections is set to a small system failure). The maximum number of connections is estimated to be 2601.1 x 1.7, equal to 500.

In order to reduce the, changes to the business code before the minimum guarantee optimized fast on-line verification, still use HttpClient3.1 MultiThreadedHttpConnectionManager, then in the off-line handwritten multithreaded test cases, We tested that the concurrency was indeed higher than when we did not use the thread pool. Then, we first went online in our Nanjing machine room with small traffic to verify the effect. After the effect met our expectations, we began to complete the transformation of the entire Beijing machine room. As a result, there was an unexpected system anomaly…

Iii. Case review

In the evening of the same day, the traffic was in line with expectations, but in the morning of the next day, I saw that some people in the user group and related operation and maintenance group could not open the live feedback page. At this time when I was on the way, I asked the watchman to help me take a look at the general situation, and found that the part that takes the most time is the part that calls the back-end service through the connection pool, so I can roughly consider the troubleshooting idea of this sudden problem around the thread pool fault.

So when I got to the company, the first thing I looked at was the overall application:

The service traffic of the monitoring platform is normal, but the network adapter traffic of some machines increases suddenly
The flat noise of the interface appears a significant rise
There is no obvious exception in the service log, which is not the cause of the underlying service timeout. Therefore, the cause of the flat ring must not be the service itself
Nine out of 30 machine instances were found to be dead, including six in Beijing and three in Nanjing

Four. In-depth investigation

It is found that nearly 1/3 of the instance processes crash, but the service traffic remains unchanged. Because the RPC service loads the traffic of the provider, the traffic of a single machine increases. As a result, the subsequent surviving instances are more likely to crash.

Because it is likely to be modified HttpClient connection mode for connection pool caused by the problem, the most likely to cause changes must be the thread and CPU state, so immediately checked the number of threads and CPU state is normal

1. CPU status

As you can see, The CPU usage of Java processes is very high, nearly 10 times higher than normal

2. Thread number monitoring status:

Diagram you can see many machine about 10 o ‘clock in the early, the number of threads of soaring, even beyond the virtualization platform for container 2000 thread count limit (platform in order to avoid the machine part of the container number of threads is too high, cause the machine set up by the whole ram die fuse protection), so the kill the instance is virtualization platforms. Why did not the thread number exceed the limit when the small flow of nanjing machine room went online before? It should be related to the small flow of Nanjing machine room, which is only 1/3 of that of Beijing machine room.

The next step is to analyze why the number of threads quickly accumulates until they exceed the limit. At this point I was wondering if there was something wrong with the maximum number of connections set by the connection pool, limiting the concurrency of connection threads in the system. To better troubleshoot the problem, I rolled back some of the instances online, so I looked at the TCP connections of the instances offline and the connections after the rollback

TCP connections before the rollback:

TCP connection status after rollback:

It is found that the concurrency of connection threads is indeed much smaller, this time to confirm whether the connection pool setting is the cause, so the machine did not roll back jStack, the allocation of child threads in the Java process was analyzed, always can confirm the problem

Jstack status:

It is easy to analyze from the jStack logs that there are a large number of threads queued waiting to get connections from the connection pool, resulting in a thread build-up and therefore an increase in flat ring. The more threads are accumulated, the more system resources are occupied, the higher the level noise of the interface will be, and the more threads are accumulated. Therefore, it is easy to cause a vicious cycle and the number of threads exceeds the limit.

So why is the concurrency setting too small? 70% of the floating space has been left to estimate the concurrency, there must be something wrong!

So I understand the source read analysis, found the clue:

Such as MultiThreadedHttpConnectionManager source can be seen, the connection pool doGetConnection method calls, when allocating connection to access connection, will not only to whether I set the parameter maxTotalConnections transfinite check, MaxHostConnections is also checked to see if the limit is exceeded.

So I immediately search the meaning of the maxHostConnections: each host route of the default maximum connection, need through the setDefaultMaxConnectionsPerHost Settings, otherwise the default value is 2.

So not the maximum number of connections to the business I calculation error, but because I don’t know to set up the DefaultMaxConnectionsPerHost caused each request of the Host only 2 number of simultaneous connections, – limit the concurrency of a thread obtaining a connection (so no wonder there are only two connections established when looking at TCP concurrency 😃)

V. Summary of the case

Now that the fundamental problem of this avalanche has been thoroughly identified, let’s refine the whole process of this case again:

The maximum number of connections is 2. Procedure

Queued buildup occurs when a large number of requesting threads need to wait for the connection pool to release the connection

When the number of rammed threads increases, the flat noise of the interface increases, which occupies more system resources and increases the interface time and thread accumulation

Finally, the thread limit is exceeded and the instance is killed by the virtualization platform

Some instances are suspended, causing traffic to be transferred to other surviving instances. The flow pressure of other instances becomes high, which is easy to trigger avalanches

As for the optimization scheme and how to avoid such problems from happening again, I think of three schemes:

Before upgrading the technology, read the relevant official technical documents carefully. It is better not to omit any details
Look for other reliable open source projects online and see how other people’s great projects are being used. Github, for example, can search for technology keywords to find open source projects that use the same technology. Be careful to select high quality projects for reference
First online pressure measurement, with the control variable method to compare the different situations of various Settings, so that all the problems in the line in advance exposed, then online heart will have a bottom

The following is a pressure measurement scheme designed by me:

A. When the connection pool is not used or used, analyze the QPS peak value and the change in the number of threads

B. contrast setDefaultMaxConnectionsPerHost setting and not setting, analysis of the overall to withstand QPS peak and the change of the number of threads

C. contrast adjustment setMaxTotalConnections, setDefaultMaxConnectionsPerHost threshold, analysis the overall to withstand the QPS peak and the change of the number of threads

D. Pay attention to the number of threads, CPU usage, TCP connection number, port usage, and memory usage of the instance

To sum up, the avalanche problem caused by a connection pool parameter has been solved from analysis to location. We should be careful about the technology point of upgrade when we are making technological transformation. After the emergence of problems, we should focus on the analysis of the characteristics and laws of the problem, to find the common denominator to find out the root cause.

Link: https://blog.csdn.net/qq_1668…

Copyright Notice: This article is written BY CSDN blogger “ZXCODEStudy” and is reproduced under THE CC 4.0 BY-SA copyright agreement. Please attach a link to the original source and this notice.

Recent hot articles recommended:

1.1,000+ Java interview questions and answers (the latest version of 2021)

2. Finally got the IntelliJ IDEA activation code through the open source project, how sweet!

3. Ali Mock tool officially open source, kill all the Mock tools on the market!

4.Spring Cloud 2020.0.0 is officially released, a new and disruptive version!

5. “Java Development Manual (Songshan version)” the latest release, speed download!

Feel good, don’t forget to click “like” + forward oh!

An avalanche caused by improper HttpClient setup!

I. Event background

2. Problem process

Iii. Case review

Four. In-depth investigation

V. Summary of the case

Related Posts

<JVM Part 2: Performance monitoring and tuning article >04-JVM runtime parameters

JAVA docking domestic 1200 express companies express mail sample code

Insights Live: 3D modeling service to quickly build high-quality 3D models