Send you the following Java learning materials, at the end of the article there is a way to receive







I. Event background

I recently operated and maintained an online real-time interface service. Recently, Address already in use (Bind failed) often appeared.

There are a large number of time\_wait connections that have not been released, resulting in the full number of ports (6W + at the highest time). As a result, HttpClient may run into a port conflict when establishing a connection.

The details are as follows:

Time \ _wait characteristics

In order to solve the problem of time\_wait, I did some research on the Internet and thought that I could use the connection pool to save the TCP connection, so as to reduce the number of random open ports in the concurrent case of HttpClient and reuse the original valid connection. But new problems are also introduced by the connection pooling setup.

II. Problem process

When estimating the maximum number of connections in the connection pool, it refers to the request volume of 1.2W PV in 1 minute and the interface flat response of 1.3S in the peak business period (a complex simulation system of advertising promotion effect, in which high flat response is the reason for the business requirement).

Therefore, the QPS is 12000*1.3\60=260. By observing the business log, it takes about 1.1s for each connection to be established, and 70%+ floating space is left (for fear of system failure due to small connection number setting). The maximum number of connections is estimated to be 260*1.1*1.7, which is equal to about 500.

In order to reduce the, changes to the business code before the minimum guarantee optimized fast on-line verification, still use HttpClient3.1 MultiThreadedHttpConnectionManager, then in the off-line handwritten multithreaded test cases, After testing that the concurrency is indeed higher than that without the thread pool, we first verified the effect in our Nanjing machine room with small flow, and the effect also met the expectation. Then we started the transformation of the whole Beijing machine room. The result was an unexpected system anomaly after the complete transformation…

III. Case review

In the evening after the traffic turned full, the situation was in line with expectations, but in the morning of the next day, I saw that some people in the user group and related operation and maintenance group could not open the feedback live page. At this time, I was on the road and asked the person on duty to take a look at the general situation first, and found that the most time-consuming part was the part of calling back-end service through connection pool. Therefore, the troubleshooting idea of this sudden problem could be roughly determined to be based on the failure of thread pool.

So when I got there, the first thing I did was look at the overall application:

  • The service traffic of the monitoring platform was normal, but the network card traffic of some machines increased slightly
  • The flat noise of the interface shows a significant increase
  • The business log has no obvious exceptions and is not the cause of the timeout of the underlying service, so the cause of the flat noise is definitely not the business itself
  • It was found that 9 out of 30 machine instances were hanged, including 6 in Beijing and 3 in Nanjing

IV. In-depth investigation

As it was found that nearly 1/3 of the instance process crashed, while the business traffic did not change, because the RPC service load balancing the provider’s traffic, so it caused the increase of the single machine traffic, which would lead to the following surviving instances are more likely to crash, so the reason why the process hung.

Since the HttpClient connection is probably being modified to pool connections, the most likely changes are in thread and CPU state, so we immediately checked to see if the thread count and CPU state are healthy

  1. State of the CPU

The CPU features

As you can see from the graph, Java processes are using up to 10 times more CPU than usual

  1. Thread count monitoring status:



Diagram you can see many machine about 10 o ‘clock in the early, the number of threads of soaring, even beyond the virtualization platform for container 2000 thread count limit (platform in order to avoid the machine part of the container number of threads is too high, cause the machine set up by the whole ram die fuse protection), so the kill the instance is virtualization platforms. The reason why the number of threads did not exceed the limit when the small flow was online in Nanjing machine room should be related to the small flow in Nanjing machine room, which was only 1/3 of that in Beijing machine room.

The next step is to analyze why the number of threads accumulates so quickly that it exceeds its limit. At this point, I wonder if there is a problem with the maximum number of connections set by the connection pool, which limits the concurrency of the system’s connection threads. In order to troubleshoot the problem better, I rolled back a part of the online instance, and then observed the TCP connection of the offline instance and the connection after the rollback

TCP connection before rollback:

TCP connection condition after rollback:

It was found that the concurrency of the connection thread was much smaller. At this time, it was necessary to confirm whether it was caused by the connection pool setting. Therefore, the machine that did not roll back was jstack, and the child thread allocated in the Java process was analyzed, so that the problem could be confirmed

Jstack status:

It is easy to analyze from jstack’s logs that there are a large number of threads queuing to get connections from the connection pool, thus causing the stack of threads and the rise of the flat sound. The more threads are stacked, the more system resources are occupied, and the interface noise will increase, which further aggravates the stack of threads. Therefore, it is easy to appear a vicious circle and cause the thread number to exceed the limit.

So why is the concurrency setting too small? I’ve already left 70% of the floating space to estimate concurrency, there’s got to be something wrong with that!

So I read and analyzed the source code and found the clue:

The source code

Such as MultiThreadedHttpConnectionManager source can be seen, the connection pool doGetConnection method calls, when allocating connection to access connection, will not only to whether I set the parameter maxTotalConnections transfinite check, It also checks whether the maxHostConnections are out of limit.

So I immediately search the meaning of the maxHostConnections: each host route of the default maximum connection, need through the setDefaultMaxConnectionsPerHost Settings, otherwise the default value is 2.

So not the maximum number of connections to the business I calculation error, but because I don’t know to set up the DefaultMaxConnectionsPerHost caused each request of the Host only 2 number of simultaneous connections, – Limits the concurrency of threads acquiring connections (so no wonder when looking at TCP concurrency just now, only 2 connections were made 😃)

V. Summary of the case

Now that the fundamental problem of this avalanche incident has been thoroughly identified, let’s briefly summarize the whole process of this case again:

  1. The connection pool set the wrong parameter, resulting in a maximum number of connections of 2
  2. A large number of requesting threads need to wait for the connection pool to release the connection, resulting in queuing build-up
  3. The more rammed threads, the higher flat noise of the interface, and the more system resources are occupied, which will aggravate the time-consuming increase of the interface and the stack of threads
  4. Finally, until the thread is exceeded, the instance is killed by the virtualization platform
  5. Some instances hang, causing traffic to be diverted to other living instances. Other examples of flow pressure increases, easy to trigger avalanches

As for the optimization scheme and how to avoid such problems from happening again, I think of three schemes:

  1. Before you upgrade your technology, read the official technical documentation carefully. You’d better not omit any details
  2. Look for other solid open source projects on the web to see how other people’s great projects are being used. For example, on GitHub, you can search for technology keywords to find open source projects that use the same technology. Pay attention to the selection of high quality items for reference
  3. First online pressure test, using the control variable method to compare the different situations of various Settings, so that all the problems are exposed offline in advance, and then the online heart has a bottom

The following is a pressure test scheme I designed:

A. Test the peak QPS and the change in the number of threads that can be supported by the whole system without connection pooling and with connection pooling

B. contrast setDefaultMaxConnectionsPerHost setting and not setting, analysis of the overall to withstand QPS peak and the change of the number of threads

C. contrast adjustment setMaxTotalConnections, setDefaultMaxConnectionsPerHost threshold, analysis the overall to withstand the QPS peak and the change of the number of threads

D. Focus on the number of threads, CPU utilization, TCP connections, port usage, and memory utilization of the instance during the test

To sum up, the avalanche problem caused by the parameters of a connection pool has been completely solved from analysis to localization. We should be careful about upgrading the technology points in the process of technological transformation. After the occurrence of problems, we should focus on the analysis of the characteristics and rules of the problem, find out the commonality to find out the root cause.