Those of you who have read Refactoring – Improving the Design of Existing Code are familiar with the bad smell of code. Determining what constitutes a code “bad smell” is subjective, of course, and varies by language, developer, and development method. In my work, I often maintain the previous project and add some new functions on this basis. In order to make the project code easy to understand and maintain, I should always pay attention to the “bad taste” in the code. When the code has a bad taste, I should timely reconstruct it to make it good and clean code. Today we are going to talk about the impact of “bad code” on system performance. I will show you a few cases, hoping to inspire and help you.

FGC practice: Bad code causes frequent FGC unresponsiveness problem analysis

The problem

Network problems?

At 7:00 PM, I began to receive alarm emails constantly, which showed that several interfaces of detection had timeout. Most execution stacks are in:

java.io.BufferedReader.readLine(BufferedReader.java:371)
java.io.BufferedReader.readLine(BufferReader.java:389)
java_io_BufferedReader$readLine.call(Unknown Source)
com.domain.detect.http.HttpClient.getResponse(HttpClient.groovy:122)
com.domain.detect.http.HttpClient.this$2$getResponse(HttpClient.groovy)
Copy the code

We set the HTTP DNS timeout to 1s, connect timeout to 2s, and read timeout to 3s. All these errors are caused by the probe service sending the HTTP request normally, and the server responding to the request normally. But packets are lost in network layering, so the execution stack of the requesting thread stays where it gets the interface response. The typical feature of this situation is that the corresponding log records can be found on the server. And the log shows that the server is responding completely normally. In contrast, the thread stack stays at Socket connect, which fails when the connection is being built, and the server is completely unaware.

I noticed that one of the interface error occurs more frequently, this interface need to upload a 4 m file to the server, and then through a series of business logic processing, and then returns the text data of 2 m, while the rest of the interface is simple business logic, I guess it is need to upload download data too much, so the timeout caused packet loss probability is greater.

According to this hypothesis, the group goes to the server, searches the recent service logs with the request_id of the request, and sure enough, the interface times out due to network packet loss.

Of course, in this way, the leader will not be satisfied, and this conclusion has to be taken over by someone. Immediately contact the operations and peacekeeping network team to confirm the status of the network. The network group replied that the switch in the machine room where the detection service was located was old and there was an unknown forwarding bottleneck and it was being optimized. This made me feel more relieved, so I made a brief explanation in the department group and finished the task.

The problems

I thought this shift would be such a small wave, but the result was more than eight o ‘clock in the evening, all kinds of alarm emails flooded in, and I was unprepared to pack up my things and have a single rest on Sunday.

This time, almost all the interfaces are timed out, and our interface with a large amount of network I/O is timed out every time. Is the whole machine room faulty?

Through the server and monitoring again, I found that the indicators of each interface were normal, and the interface was completely OK after testing by myself. Since the online service is not affected, I plan to stop the detection task through the interface of the detection service and then slowly troubleshoot.

It was only then that I realized it wasn’t so simple.

To solve

A memory leakSo quickly login detection server, first is top free DF three company, the results really found some abnormalities.

Our probe process had a particularly high CPU usage of 900%.

Our Java process does not do a lot of CPU processing. Normally, the CPU should be between 100 and 200%. When this CPU surge occurs, it either goes into an infinite loop or it does a lot of GC.

Jstat -gc pid [interval] check the gc status of the Java process.

Jmap-dump :format=b,file=heap.log file=heap. Log file=heap. Then the detection service was restarted, and the alarm message finally stopped.

jstat

Jstat [-options] pid interval jstat [-options] pid interval

It supports the following viewing items:

  • Class Displays class loading information
  • Compile Compile statistics
  • Gc garbage collection information
  • GcXXX region GC details such as -gcold

Using it, you can be very helpful in locating memory problems in the JVM.

screening

The problem was solved, but in order to prevent it from happening again, the root cause had to be identified.

Analyzing stacks is easy to see if there are too many threads and what most stacks are doing.

> grep 'java.lang.Thread.State' jstack.log  | wc -l
> 464
Copy the code

Four hundred threads, no exceptions.

> grep -A 1 'java.lang.Thread.State' jstack.log | grep -v 'java.lang.Thread.State' | sort | uniq -c |sort -n 10 at java.lang.Class.forName0(Native Method) 10 at java.lang.Object.wait(Native Method) 16 at java.lang.ClassLoader.loadClass(ClassLoader.java:404) 44 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) 344 at  sun.misc.Unsafe.park(Native Method)Copy the code

The thread status also seems to be normal, next analyze the heap file.

Heap dump files are binary data, which is difficult to view on the command line. Java tools are visual, and Linux servers do not have the ability to view heap files.

Since we set the heap memory to 4G, the dump heap file is quite large, so it is quite a bit of work to download, but we can compress it first.

Gzip is a powerful compression command, in particular, we can set -1 to -9 to specify the compression level, the larger the data compression ratio, the longer the time, the recommended use of -6 to 7, -9 is too slow, and not much profit, with this compression time, the extra files can also be downloaded.

The JVM Heap MAT is a great tool for analyzing Java heap memory. Use it to open our heap files (change the file suffix to.hprof) and it will prompt us what kind of heap file to analyze. For this analysis, choose Memory Leak Suspect.

As you can see from the pie chart above, the vast majority of heap memory is occupied by the same memory. Looking at the heap memory details and tracing back to the upper level, you soon find the culprit.

Parsing the code finds the memory leak object, searches the project globally for the object name, which is a Bean object, and locates one of its properties of type Map.

The Map uses ArrayList to store the result of each probe interface response according to the type. After each probe, the result is inserted into the ArrayList for analysis. Since the Bean object is not recycled and there is no clear logic for this property, the service has not been restarted for ten days. The Map gets bigger and bigger until it fills up memory.

When memory is full, you can’t allocate any more memory for HTTP response results, so you’re stuck on readLine. And our interface with a large amount of I/O has a particularly large number of alarms, which is estimated to be related to the need for more memory due to a large response.

PR was given to the code owner, and the problem was solved satisfactorily.

summary

In fact, I have to reflect on myself, the initial alarm email also has this thread stack:

groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:166)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:186)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:186)
Copy the code

See this error thread stack but did not think carefully, to know that TCP is able to ensure message integrity, and the message did not receive the value will not assign to the variable, this is obviously an internal error, if you pay attention to the fine check is able to find the problem in advance, check the problem is really bad which ring does not work.

Source | zhenbianshu. Making. IO /

Remember a time when the JVM’s crazy GC caused the CPU to spike

background

The online Web server is very stuck from time to time. After logging in to the server top command, it is found that the server CPU is very high. After restarting Tomcat, the CPU is back to normal, and the same problem occurs again half a day or a day later. To solve the problem, it is necessary to find the outbreak point of the problem, which is very difficult to locate the occasional problem.

After restarting the server, I had to wait for the problem to occur again. At this time, I first suspected whether a timed task caused a large number of calculations or a request caused an infinite loop. So I first analyzed all the suspected places in the code and added a little log.

The strategy this time is to protect the crime scene first, because there are two points on the line, restart one point and restore it, and then take the other point offline without restarting the crime scene.

screening

On the server in question, we first looked at the business logs. We did not find a large number of duplicate logs, so we initially ruled out the possibility of an infinite loop, and then we had to analyze the JVM.

Step 1 Run the top command to check the Pids that occupy the CPU

This is a screenshot after the event, when the CPU soared to more than 500, pid is 27683

Then ps aux | grep search 27683 confirm whether our tomcat occupy CPU, the basic is for sure, because tomcat after restarting the CPU immediately down.

You can also use JPS to display a Java PID

Top -h -p 27683 top -h -p 27683 Top -h -p 27683 top -h -p 27683 top -h -p 27683 top -h -p 27683

Step 3: jstack to check the thread information, command: Jstack 27683 > > aaaa. TXT output to a text search again, also can search jstack 27683 direct pipe | grep the thread id is “6 c23” hexadecimal, said the need to turn, You can run printf “%x\n” tid with this command or you can run it on your calculator.

Unfortunately, I was introduced into a mistake when I searched the thread 6C26, and found that I was doing GC. The thread was too high due to crazy GC, but I could not find the cause of generating so many objects. I kept looking for all possible dead loops and possible memory leaks.

Then see gc per second by naming jstat -gcutil [PID] 1000 100.

This is the screenshot of the recheck afterwards, and the screenshot at that time is no longer available

I found that S0 kept creating new objects, and then GC, repeatedly gc, to look at the stack information, nothing found, nothing more than String and map objects, can not determine the code of the endless loop, also can not find the outbreak point of the problem, so I fell into a complete confusion. After a search to confirm that it is not a memory leak, struggling to find the fruitless, I fell into a meditation.

The CPU is still too high, but still JStack 27683 look at the thread stack, aimless disorderly, but found a problem, the current point I am offline, that is, there is no user access, the CPU is still this high, and the thread stack is also constantly printing, So the thread that is currently running is likely to be the culprit, immediately analyze the thread stack information, a significant discovery.

HttpProxy_jsp: httpProxy_jsp: httpProxy_jsp Was the server attacked? Code for right away, and found that there are the JSP, see the git commit record, is a few days before the other colleagues to submit code, time points and problems appear for the first time time is very consistent, feel good luck should be to find the point problem, called colleagues to analyze his code, the JSP is actually very simple, is to be an agent request, A back-end HTTP request was made.

HttpRequestUtil as follows, was written by his own tools, use common tools, one of the connection in the post method is not set connection timeout time and read timeout:

If the connection is not set to timeout or 0, it is considered infinite, that is, it will never timeout. At this time, if the requested third-party page does not respond or responds very slowly, the request will always wait. Or request didn’t come back then to once, result in this thread is stuck, but the accumulation of threads here and do not collapse has also been doing some things will produce a large number of objects, and then triggered the JVM non-stop crazy GC server CPU speed to the limit, then the server response is very slow, finally found the problem, and very accord with the characteristics of the problem, And sure enough, I wrote this in a different way with a 2 second timeout limit, and it didn’t happen again.

The process of solving this problem resulted in several observations:

1, THE JVM problem is very complex, through the log is probably not the root of the problem, the solution to the problem is also a part of luck, analysis of log + business scenario + blind is the solution to the problem, do not go all the way to black, more combined with the current scenario and some guesses.

2, the root of the problem, CPU is high, the beginning is always thinking of the code have infinite loop, then thought it was a memory leak, so leave a lot of detours, finally use the most stupid way to see thread stack logs to see the problem, so the problem of no unified solution, specific problems are specific processing, not constrained by past experience.

3, in the process of writing code to use the original project has been widely used in public utility class, try not to take their own engineering, introduced no project tested code, even seemingly simple may give a piece of code projects into disaster, unless you have enough make you understand the underlying code, such as the timeout setting problem.

Remember the use of Synchronized keyword is not reasonable, resulting in multi-threading thread blocking problem investigation

While performing performance diagnostic tuning for a client, I encountered a situation where improper use of the Synchronized keyword led to thread blocking in multiple threads. The whole process of discovery, investigation, analysis and optimization of the problem was recorded in writing. XLand performance analysis platform, our company’s commercial product, was used in the investigation process. Through this article, I hope to share the analysis and optimization ideas and points for attention, and students who are interested in in-depth understanding can comment and exchange.

The phenomenon of

During the single-interface load test, the number of concurrent users increases from 10 to 50, the TPS remains unchanged, the response time keeps increasing, the application CPU is 27%, the database CPU is 3%, and the resource consumption remains stable. Therefore, the application may have a bottleneck.

Analysis of the

Through thread analysis of XLand analysis platform, it is found that a certain thread has lock waiting. Through X analysis in XLand, it is found that the getAccessor method in AuthProvider class has Synchronized keyword. When two or more threads call this synchronization method at the same time, Only one thread can enter the method at a time, and other threads must wait for the previous thread to complete the synchronization method before they can enter.

The risk point

The Synchronized keyword addresses the synchronization of access to resources between multiple threads. The Synchronized keyword ensures that only one thread can execute a method or code block modified by it at any time. Use the Synchronized keyword with caution to prevent unnecessary thread blocking and response time.

Optimization measures

After removing the Synchronized keyword in AuthProvider class, it is found that the TPS of the interface to judge whether the login is normal or not increases from 174 pens/second to 624 pens/second under the condition of 10 concurrent users, an increase of 3 times. Use synchronized sparingly in everyday programming, not unless there is a need for multithreading to modify static variables or singleton properties, and if necessary it is recommended to lock only necessary code blocks rather than entire methods.

Afterword.

Java application performance bottlenecks are numerous, such as disk, memory, network I/O and other system factors, Java application code, JVM GC, database, cache, etc. Generally, Java performance optimization can be divided into four levels: application layer, database layer, framework layer, and JVM layer. The difficulty of each layer of optimization increases step by step, and the knowledge involved and problems solved will be different. But to be honest, most problems don’t require you to understand the framework source code, JVM parameters, and GC mechanism. You just need to analyze SQL, understand the code logic, and be able to locate the problem Java code and make changes. After all, there is a saying that 80% of performance problems are caused by bad code, hahaha. It’s a bit sharp, but maintaining good coding habits, using certain keywords that can cause problems, and using memory resources wisely can avoid a lot of problems. Well, finally, I wish you all a thousand lines without bugs!