Online faults mainly include CPU, disk, memory, and network faults. Most faults may involve problems at more than one level. Therefore, try to check all four aspects in sequence. \

At the same time, such as jSTACK, JMAP and other tools are not limited to one aspect of the problem, basically the problem is DF, free, top three, and then jSTACK, JMAP service, specific problems can be analyzed.

CPU

We usually start with CPU issues. CPU exceptions are often easier to locate. Reasons include business logic problems (endless loops), frequent GC, and too many context switches.

The most common ones are caused by business logic (or framework logic), and you can use JStack to analyze the corresponding stack.

① Use JStack to analyze CPU problems

Use the ps command to find the PID of the corresponding process (if you have several target processes, you can use top first to see which process has the highest occupancy).

Then use top-h-p PID to find some threads with high CPU usage:

Then convert the most occupied PID to hexadecimal printf ‘%x\n’ pid to get nID:

Then directly find corresponding stack information in jstack jstack pid | grep ‘nid’ – C5 – color:

You can see that we have found the stack information with nID 0x42, and then we just need to analyze it carefully.

Of course, it is more common to analyze the entire JStack file, and we usually focus on WAITING and TIMED_WAITING, not to mention BLOCKED.

We can use the command cat jstack. Log | grep “Java. Lang. Thread. State” | sort – nr | uniq -c to jstack State have a whole grasp, if WAITING and so much more special, Then there’s probably something wrong with it.

(2) the frequent GC

Of course we still use JStack to analyze problems, but sometimes we can make sure that GC is too frequent.

Run the jstat -gc pid 1000 command to observe gc generation changes. 1000 indicates the sampling interval (ms). S0C/S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent the capacity and usage of two Survivor zones, Eden zone, old age zone, and metadata zone respectively.

YGC/YGT, FGC/FGCT, and GCT represent the YoungGc and FullGc time, times, and total time.

If you see GC frequently, refer to the GC section for further analysis.

③ Context switch

For frequent context problems, we can use the vmstat command to check:

The cs column represents the number of context switches. If we want to monitor a particular PID, we can use the pidstat -w pid command, CSWCH and NVCSWCH for voluntary and involuntary switching.

disk

The disk problem is as basic as the CPU. First, the disk space aspect, we use df-hl directly to check the file system status:

More often than not, disk issues are performance issues. We can analyze it by iostatiostat -d k-x:

The last column, %util, shows the extent of each disk’s write, while RRQPM /s and WRQM /s indicate the read/write speed, respectively, which generally helps to locate the specific disk where the problem occurred.

In addition, we also need to know which process is doing the reading and writing, generally develop their own idea, or use the iotop command to locate the source of file reading and writing.

Pidreadlink -f /proc/*/task/tid/.. pidreadlink -f /proc/*/task/tid/.. /.. .

Cat /proc/pid/io

We can also use the lsof command to determine the specific file read/write status lsof -p pid:

memory

The troubleshooting of memory problems is more troublesome than CPU, and the scene is more. This includes OOM, GC issues, and off-heap memory.

Generally speaking, we will first use the free command to check the various conditions of the memory of the first round:

Within the heap memory

Memory problems are mostly heap memory problems. In appearance, it is mainly divided into OOM and Stack Overflow.

1) OOM

JMV memory is insufficient. OOM can be divided into the following categories:

Exception in thread “main” java.lang.OutOfMemoryError: unable to create new native thread

This means that there is not enough memory to allocate the Java stack to the thread. Basically, the thread pool code is writing the problem, such as forgetting to shutdown, so you should look at the code level first, using jstack or jmap.

If all is well, the JVM side can reduce the size of the individual Thread Stack by specifying Xss.

Also can at the system level, by modifying the/etc/security/limits confnofile and nproc to increase OS the limitation on the thread.

Exception in thread “main” java.lang.OutOfMemoryError: Java heap space

This means that the heap has reached the maximum of the -xmx setting, which is probably the most common OOM error.

The solution is still to find in the code, suspected memory leaks, through jStack and JMap to locate the problem. If all is well, you need to expand memory by adjusting the value of Xmx.

Caused by: java.lang.OutOfMemoryError: Meta space

MaxMetaspaceSize: XX:MaxPermSize: XX:MaxPermSize: XX:MaxPermSize: XX:MaxPermSize: XX:MaxPermSize: XX:MaxPermSize: XX:MaxPermSize

(2) Stack Overflow

Stack memory overflow, which you see a lot.

Exception in thread “main” java.lang.StackOverflowError

Indicates that the memory required by the thread stack is greater than the Xss value, which is also checked first. The parameter is adjusted by Xss, but too large adjustment may cause OOM.

③ Use JMAP to locate code memory leaks

JMAPjmap -dump:format=b,file=filename pid

Eclipse Memory Analysis Tools (MAT) is used to import dump files for Analysis. Generally, we choose Leak Suspects directly, and MAT gives suggestions for Memory Leak.

Alternatively, you can select Top Consumers to view the maximum object report. You can select Thread Overview to analyze problems related to threads.

In addition, select the Histogram class overview to do your own analysis, as you can do in mat’s tutorial.

In everyday development, memory leaks in code are common and subtle, requiring more attention to detail.

For example, new objects are created on every request, resulting in a large number of repeated object creation. File stream operation performed but not closed properly; Manually triggering GC improperly; If the ByteBuffer is not allocated properly, the code will be OOM.

On the other hand, we can specify it in the startup parameters

– XX: + HeapDumpOnOutOfMemoryError to save OOM when the dump file.

④GC issues and threads

GC problems affect not only CPU but also memory. The troubleshooting method is the same. Jstat is used to check generational changes, such as youngGC or FullGC. Is the growth of EU, OU and other indicators abnormal?

Unable to create new native Thread: Unable to create new native thread: Unable to create new native thread

Besides jstack careful analysis of the dump file, we will look at the overall thread first commonly, through pstreee -p pid | wc -l.

Or you can view the number of threads in /proc/pid/task.

Out of memory

If you run into an out-of-heap memory overflow, that’s unfortunate. First of all, the performance of out-of-heap memory overflow is that the physical resident memory grows fast. If an error is reported, the usage mode is uncertain.

An OutOfDirectMemoryError may be reported in the error log if it is caused by using Netty. If DirectByteBuffer is directly used, an OutOfMemoryError will be reported: Direct buffer memory.

Outside the heap memory leak is often associated with the use of NIO, usually we first through the pmap to view the process memory situation pmap -x pid | sort – rn – k3 | head – 30, the means to check the corresponding pid top 30 of the segment in reverse chronological order.

You can run the command again after a certain amount of time to see the memory growth, or where the memory segment is more suspicious than the normal machine.

GDB –batch –pid {pid} -ex “dump memory filename.dump {start memory address} {start memory address + block size}” GDB –batch –pid {pid} -ex “dump memory filename.dump {start memory address} {start memory address + block size}”

After get the dump file available heaxdump view hexdump) -c filename | less, but most are binary garbled.

NMT is a new HotSpot feature introduced in Java7U40. With the JCMD command we can see the specific memory composition.

You need to add -xx :NativeMemoryTracking=summary or -xx :NativeMemoryTracking=detail to the startup parameter, which causes slight performance loss.

JCMD PID VM. Native_memory baseline JCMD PID VM. Native_memory baseline JCMD PID VM.

Then wait a while to see how the memory grows and do summary or detail level diff with JCMD PID VM. Native_memory detail.diff(summary.diff).

As you can see, the JCMD analysis of memory is very detailed, including in heap, thread and GC (so the above mentioned other memory exceptions can be analyzed by NMT). In this aspect, we focus on the growth of Internal memory, if the growth is too obvious, then there is a problem.

At the detail level, there are also specific memory segment growth conditions, as shown in the following figure:

Strace -f -e “BRK,mmap,munmap” -p pid can be used to monitor memory allocation.

The memory allocation information mainly includes PID and memory address:

The key is to look at the error log stack, find the suspicious object, understand its recycling mechanism, and then analyze the corresponding object.

For example, DirectByteBuffer allocates memory that requires Full GC or manual system. GC (so it is best not to use -xx :+DisableExplicitGC).

We can track the memory of DirectByteBuffer objects and manually trigger Full GC with jmap-histo: Live PID to see if the out-of-heap memory is collected.

If it is recycled, there is a high probability that the off-heap memory itself is allocated too small, adjusted by -xx :MaxDirectMemorySize.

If nothing changes, it is time to use JMap to analyze objects that cannot be GC and references to DirectByteBuffer.

The GC problem

Heap memory leaks are always associated with GC exceptions. However, GC issues are not only related to memory problems, but also can cause CPU load, network problems and other complications, but more closely related to memory, so we will summarize GC issues separately here.

In the CPU chapter, we introduced the use of jstat to obtain the current GC generational change information.

Verbose: GC, -xx :+PrintGCDetails, -xx :+PrintGCDateStamps, -xx :+PrintGCTimeStamps to enable GC logging.

Common Young GC and Full GC log meanings will not be described here. With GC logs, we can roughly infer whether youngGC and Full GC are too frequent or too long, which can be appropriate.

The G1-XX:+UseG1GC is also recommended.

1) youngGC too frequently

If youngGC is frequent, there are many small objects with short cycles. First, consider whether the Eden/New generation setting is too small, and see if the problem can be solved by adjusting parameters such as -xmn and -xx :SurvivorRatio.

If the youngGC frequency is still too high, Jmap and MAT should be used to further check the dump file.

②youngGC takes too long

The problem of taking too long depends on where the time is consumed in the GC log. Take G1 logs as an example, you can pay attention to Root Scanning, Object Copy, and Ref Proc.

Ref Proc takes a long time, so be careful which objects you refer to. Root Scanning takes a long time, so pay attention to the number of threads and cross-generation references.

Object Copy needs to pay attention to the Object life cycle. And time analysis requires horizontal comparisons, that is, with other projects or normal time periods.

For example, in the figure, Root Scanning and normal time ratio increase more, there are too many threads.

③ Trigger the Full GC

G1 is still more of a mixedGC, but mixedGC can be checked in the same way as youngGC.

When the Full GC is triggered, the problem is that G1 degenerates to the Serial collector for garbage cleaning and pauses in seconds.

Reasons for FullGC may include the following, as well as some ideas for tuning parameters:

  • Concurrent phase failure: During the concurrent marking phase, the old age is filled before the MixGC, at which point G1 will abandon the marking cycle.

    In this case, it may be necessary to increase the heap size or adjust the number of concurrent flagged threads -xx :ConcGCThreads.

  • Promotion failed: There was not enough memory for the alive/promoted object at GC, so Full GC was triggered.

    This time can pass – XX: G1ReservePercent percentage to increase the reserved memory, reduce – XX: InitiatingHeapOccupancyPercent to start tag, in advance – XX: ConcGCThreads to increase the tag number of threads is also possible.

  • Large object allocation failure: The large object cannot find proper Region space to allocate, and the Full GC is performed. In this case, the memory can be increased or -xx :G1HeapRegionSize can be increased.

  • The program actively executes system.gc () : Don’t just write it.

-xx :HeapDumpPath=/ XXX /dump.hprof to dump fullGC files and jinfo to dump fullGC files.

jinfo -flag +HeapDumpBeforeFullGC pid 
jinfo -flag +HeapDumpAfterFullGC pid
Copy the code

This results in two dump files, which are compared to focus on the problem objects removed by GC to locate the problem.

network

Problems related to the network level are generally more complex, more scenes, difficult to locate, become the nightmare of most development, should be the most complex.

Some examples will be given here, and the TCP layer, application layer, and tool usage will be explained.

(1) the timeout

Timeout errors are mostly at the application level, so this one focuses on understanding the concept. Timeouts can be broadly divided into connection timeouts and read and write timeouts. Some client frameworks that use connection pooling also have access timeouts and idle connection clearing timeouts.

Read /writeTimeout: readTimeout/writeTimeout. Some frameworks are called so_timeout or socketTimeout.

Note that most of the timeouts here are logical timeouts. Soa timeouts are also read timeouts. Read and write timeout is usually set only for clients.

ConnectionTimeout: connectionTimeout, the maximum time for a client to establish a connection with a server.

On the server side, connectionTimeout is a bit different. Jetty indicates the idle connection cleanup time, and Tomcat indicates the maximum connection duration.

Others include connection obtaining timeout connectionAcquireTimeout and idle connection clearing timeout idleConnectionTimeout. A client or server framework that uses connection pooling or queues.

In setting various timeout periods, we need to ensure that the client timeout is smaller than the server timeout to ensure the normal termination of the connection.

In real development, the biggest concern would be that the interface’s read and write times out. How to set a reasonable interface timeout is a problem.

If the timeout setting of the interface is too long, it may occupy too many TCP connections on the server. If the interface is set too short, it will time out very frequently.

Another problem is that the client continues to time out even though the server interface has been reduced. This problem is simple. The link between the client and the server includes network transmission, queuing, and service processing. Each link may cause time consuming.

② THE TCP queue overflows

TCP queue overflow is a relatively low-level error that can cause more superficial errors such as timeouts and RST. So the mistakes are more subtle, so let’s talk about them separately.

As shown in the figure above, there are two queues:

  • Syns Queue
  • Accept queue (full connection queue)

After receiving a SYN from the client, the server places the syn and ACK message to the syns queue and replies the SYN and ACK message to the client. The server receives the ACK message from the client.

If the Accept queue is not full at this point, put temporary messages from syns queue into the Accept queue, otherwise follow tcp_ABORT_on_overflow instructions.

Tcp_abort_on_overflow 0: If the accept queue is full at step 3 of the three-way handshake, the server throws the ack sent by the client.

Tcp_abort_on_overflow 1 indicates that in step 3, if the full connection queue is full, the server sends an RST packet to the client, indicating that the handshake process and the connection are discarded. This means there may be a lot of Connection reset/ Connection reset by peer in the log.

So in real development, how can we quickly locate TCP queue overflow?

The netstat command, perform netstat -s | egrep “listen | listen” :

As shown in the figure above, Overflowed represents the number of fully connected queue overflows and Sockets dropped represents the number of half-connected queue overflows.

Ss command to execute ss-lnt:

Send-q indicates that the maximum size of the full connection queue on the Listen port in column 3 is 5. Recv-q indicates how much the full connection queue is currently used.

Let’s look at how to set the full connection queue size: The full connection queue size depends on min (Backlog, somaxconn).

Backlog is passed in when sockets are created, and somaxconn is an OS level system parameter. The size of the half-connection queue depends on Max (64, /proc/sys/net/ipv4/tcp_max_syn_backlog).

In daily development, we often use Servlet containers as servers, so we sometimes need to pay attention to the size of the container’s connection queue.

In Tomcat the backlog is called acceptCount; in Jetty it is acceptQueueSize.

(3) abnormal RST

The RST package represents a connection reset and is used to close unwanted connections, usually to indicate an abnormal close, as opposed to four waves.

In real development, we often see connection reset/ Connection reset by peer errors as a result of RST packages.

Nonexistent Port: If a server sends a SYN request to establish a connection to a nonexistent port, it returns an RST message to break the connection.

Terminating a connection on behalf of the FIN: Generally speaking, a NORMAL connection is closed through a FIN packet. However, an RST packet can be used to terminate the connection.

In practical development, SO_LINGER value can be set to control, often deliberately, to skip TIMED_WAIT, provide interaction efficiency, and be cautious when not idle.

When an exception occurs on one side of the client or server, it sends an RST to the peer to close the connection: the TCP queue overflow that we described above sends an RST packet falls into this category.

This is usually because one party can no longer process the connection properly for some reason (e.g. the application crashes or the queue is full) and tells the other party to close the connection.

The received TCP packet is not in the known TCP connection: For example, one party loses a TCP packet due to the poor network, and the other party closes the connection. After a long time, the other party receives the missing TCP packet, but the corresponding TCP connection no longer exists. In this case, the party directly sends an RST packet to open a new connection.

If one party does not receive any acknowledgement packet from the other party for a long time, it sends an RST packet after a certain period of time or retransmission times

This is mostly related to the network environment. Poor network environment may result in more RST packets.

As mentioned earlier, too many RST packets will cause the program to report an error. If a read operation is performed on a closed connection, connection reset will be reported, while if a write operation is performed on a closed connection, Connection reset by peer will be reported.

You may also see the broken pipe error, which is a pipe-level error that reads or writes to a closed pipe and continues reading or writing datagrams after receiving a CONNECTION reset error, as described in the glibc source code comments.

How do we determine the existence of RST packages during troubleshooting? Of course, packets are captured using the tcpdump command and analyzed using wireshark.

Tcpdump -i en0 TCP -w xxx.cap, en0 indicates the listening network adapter:

Next, you can open the captured packages using Wireshark and see the following figure. The red ones represent the RST packages.

(4) TIME_WAIT and CLOSE_WAIT

We all know what TIME_WAIT and CLOSE_WAIT mean.

Online, we can directly use the command netstat -n | awk ‘/ ^ TCP / {+ + S [$NF]} END {for (S) in a print a, S [a]}’ to view the amount of time – wait and close_wait.

With ss command will faster ss – ant | awk ‘[$1]} {+ + S END {for (S) in a print a, S [a]}’ :

TIME_WAIT: TIME_WAIT exists to either reuse lost packets for subsequent connections or close connections properly within the 2MSL time range.

Its presence would actually reduce the number of RST packages. Excessive time_wait tends to occur in scenarios where short connections are frequent.

In this case, some kernel parameters can be tuned on the server side:

# enables reuse. Allows time-wait Sockets to be reused for new TCP connections. Default0Net.ipv4. tcp_tw_reuse = is disabled1# enables fast recycling of time-wait Sockets in TCP connections. Default0To disable net.ipv4.tcp_tw_recycle =1
Copy the code

Tcp_max_tw_buckets = tcp_max_TW_buckets “Time wait bucket table overflow”

CLOSE_WAIT: CLOSE_WAIT usually occurs because the application fails to initiate FIN packets after ACK.

Close_wait is even more likely to occur than time_wait, with more serious consequences. It is often because something is blocked and the connection is not closed properly, gradually consuming all the threads.

To locate this problem, it is best to use JStack to analyze the thread stack to troubleshoot the problem, as described in the section above. Here is just one example.

The developer said that CLOSE_WAIT increased after the application went live until it died. Jstack found a suspicious stack where most threads were stuck in the countdownlatch.await method.

After checking with the developer, I learned that there was no catch exception but multithreading was used. After modification, I found that the exception was only the simplest class not found that often occurs after upgrading the SDK.

PS: If you think my share is good, you are welcome to like it and look at it.

Welcome to leave a message ~~

Source: fredal. Xin/Java – error – check