origin


A few days ago, the team was on duty and took turns to take care of our service, mainly dealing with alarm email, Bug investigation and operation issue. Weekdays are fine. You have to go to work no matter what. Weekends are ruined.

I don’t know whether the company’s network is so wide or the network operation and maintenance team is not to help. There are always problems in the network, either the switch is off the network or the router is broken there, and sometimes there are various timeouts. However, our sensitive service detection service can always accurately catch the occasional small problems and add spice to the beautiful work. For several times, the partners of the duty group joked together, discussing how to avoid the service survival mechanism, and secretly stopped the detection service without being discovered (although they did not dare to).

The other day I handled a probe service pot over the weekend.

Reprinted at will, please note the source address: Zhenbianshu.github. IO, the article continues to be revised.

The problem


Network problems?

At 7:00 PM, I began to receive alarm emails constantly, which showed that several interfaces of detection had timeout. Most execution stacks are in:

java.io.BufferedReader.readLine(BufferedReader.java:371)
java.io.BufferedReader.readLine(BufferReader.java:389)
java_io_BufferedReader$readLine.call(Unknown Source)
com.domain.detect.http.HttpClient.getResponse(HttpClient.groovy:122)
com.domain.detect.http.HttpClient.this$2$getResponse(HttpClient.groovy)
Copy the code

We set the HTTP DNS timeout to 1s, connect timeout to 2s, and read timeout to 3s. All these errors are caused by the probe service sending the HTTP request normally, and the server responding to the request normally. But packets are lost in network layering, so the execution stack of the requesting thread stays where it gets the interface response. The typical feature of this situation is that the corresponding log records can be found on the server. And the log shows that the server is responding completely normally. In contrast, the thread stack stays at Socket connect, which fails when the connection is being built, and the server is completely unaware.

I noticed that one of the interface error occurs more frequently, this interface need to upload a 4 m file to the server, and then through a series of business logic processing, and then returns the text data of 2 m, while the rest of the interface is simple business logic, I guess it is need to upload download data too much, so the timeout caused packet loss probability is greater.

According to this hypothesis, the group goes to the server, searches the recent service logs with the request_id of the request, and sure enough, the interface times out due to network packet loss.

Of course, in this way, the leader will not be satisfied, and this conclusion has to be taken over by someone. Immediately contact the operations and peacekeeping network team to confirm the status of the network. The network group replied that the switch in the machine room where the detection service was located was old and there was an unknown forwarding bottleneck and it was being optimized. This made me feel more relieved, so I made a brief explanation in the department group and finished the task.

The problems

I thought this shift would be such a small wave, but the result was more than eight o ‘clock in the evening, all kinds of alarm emails flooded in, and I was unprepared to pack up my things and have a single rest on Sunday.

This time, almost all the interfaces are timed out, and our interface with a large amount of network I/O is timed out every time. Is the whole machine room faulty?

Through the server and monitoring again, I found that the indicators of each interface were normal, and the interface was completely OK after testing by myself. Since the online service is not affected, I plan to stop the detection task through the interface of the detection service and then slowly troubleshoot.

It was only then that I realized it wasn’t so simple.

To solve


A memory leak

So quickly login detection server, first is top free DF three company, the results really found some abnormalities.

Our probe process had a particularly high CPU usage of 900%.

Our Java process does not do a lot of CPU processing. Normally, the CPU should be between 100 and 200%. When this CPU surge occurs, it either goes into an infinite loop or it does a lot of GC.

Jstat -gc pid [interval] check the gc status of the Java process.

Jmap-dump :format=b,file=heap.log file=heap. Log file=heap. Then the detection service was restarted, and the alarm message finally stopped.

jstat

Jstat [-options] pid interval jstat [-options] pid interval

It supports the following viewing items:

  • -class Displays class loading information
  • -compile Compiles statistics
  • – Gc garbage collection information
  • -gcxxx Details of GC in each region, such as -gcold

Using it, you can be very helpful in locating memory problems in the JVM.

screening


The problem was solved, but in order to prevent it from happening again, the root cause had to be identified.

Analysis of the stack

Stack analysis is simple to see if there are too many threads and what most stacks are doing.

> grep 'java.lang.Thread.State' jstack.log  | wc -l
> 464
Copy the code

Four hundred threads, no exceptions.

> grep -A 1 'java.lang.Thread.State' jstack.log | grep -v 'java.lang.Thread.State' | sort | uniq -c |sort -n 10 at java.lang.Class.forName0(Native Method) 10 at java.lang.Object.wait(Native Method) 16 at java.lang.ClassLoader.loadClass(ClassLoader.java:404) 44 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) 344 at  sun.misc.Unsafe.park(Native Method)Copy the code

The thread status also seems to be normal, next analyze the heap file.

Download the heap dump file.

The heap files are some binary data, it is very troublesome to view on the command line, Java tools are visual for us, and Linux server can not view, so the first to download the file to the local.

Since we set the heap memory to 4G, the dump heap file is quite large, so it is quite a bit of work to download, but we can compress it first.

Gzip is a powerful compression command, in particular, we can set -1 to -9 to specify the compression level, the larger the data compression ratio, the longer the time, the recommended use of -6 to 7, -9 is too slow, and not much profit, with this compression time, the extra files can also be downloaded.

Analyze the JVM heap using MAT

MAT is a great tool for analyzing Java heap memory. Use it to open our heap files (change the file suffix to.hprof) and it will prompt us for what kind of file to analyze. For this analysis, choose Memory Leak Suspect.

As you can see from the pie chart above, the vast majority of heap memory is occupied by the same memory. Looking at the heap memory details and tracing back to the upper level, you soon find the culprit.

Analysis of the code

Once the memory leak object is found, search the project globally for the object name, which is a Bean object, and locate one of its properties of type Map.

The Map uses ArrayList to store the result of each probe interface response according to the type. After each probe, the result is inserted into the ArrayList for analysis. Since the Bean object is not recycled and there is no clear logic for this property, the service has not been restarted for ten days. The Map gets bigger and bigger until it fills up memory.

When memory is full, you can’t allocate any more memory for HTTP response results, so you’re stuck on readLine. And our interface with a large amount of I/O has a particularly large number of alarms, which is estimated to be related to the need for more memory due to a large response.

PR was given to the code owner, and the problem was solved satisfactorily.

summary


In fact, I have to reflect on myself, the initial alarm email also has this thread stack:

groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:166)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:186)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:186)
Copy the code

See this error thread stack but did not think carefully, to know that TCP is able to ensure message integrity, and the message did not receive the value will not assign to the variable, this is obviously an internal error, if you pay attention to the fine check is able to find the problem in advance, check the problem is really bad which ring does not work.

If you have any questions about this article, please leave a comment below. If you find this article helpful, please follow me on Weibo or GitHub.