Online faults mainly include CPU, disk, memory, and network faults. Most faults may involve problems at more than one level. Therefore, try to check all four aspects in sequence. At the same time, such as jSTACK, JMAP and other tools are not limited to one aspect of the problem, basically the problem is DF, free, top three, and then jSTACK, JMAP service, specific problems can be analyzed.

CPU

We usually start with CPU issues. CPU exceptions are often easier to locate. Reasons include business logic problems (endless loops), frequent GC, and too many context switches. The most common ones are caused by business logic (or framework logic), and you can use JStack to analyze the corresponding stack.

Analyze CPU problems using JStack

Use the ps command to find the PID of the corresponding process (if you have several target processes, you can use top first to see which process has the highest occupancy). Then use top-H-P PID to find some threads with high CPU utilization

img

Then convert the pid with the highest occupancy to hexadecimalprintf '%x\n' pidGet nidThe corresponding stack information is then found directly in jStackJstack pid | grep 'nid' - C5 - colorYou can see that we have found the stack information with nID 0x42, and then we just need to analyze it carefully.

Of course, it is more common to analyze the entire JStack file, and we usually focus on WAITING and TIMED_WAITING, not to mention BLOCKED. We can use commandscat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cTo get an overall sense of the state of the JStack, if there are too many WAITING types, there is probably a problem.

Frequent gc

Of course we still use jStack to analyze problems, but sometimes we can make sure that GC is not being used too oftenjstat -gc pid 1000Command to observe gc generation changes, 1000 represents sampling interval (MS), S0C/S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent capacity and usage of two Survivor zones, Eden zone, old age zone and metadata zone respectively. YGC/YGT, FGC/FGCT, and GCT represent the YoungGc and FullGc time, times, and total time. If you see gc frequently, refer to the gc section for further analysis.

Context switch

For frequent context problems, we can usevmstatCommand to viewThe cs column represents the number of context switches. This can be used if we want to monitor a particular PIDpidstat -w pidCommands, CSWCH and NVCSWCH represent voluntary and involuntary switching.

disk

The disk problem is as basic as the CPU. The first is the disk space aspect, which we use directlydf -hlTo check the file system status

More often than not, disk issues are performance issues. We can do iostatiostat -d -k -xTo analyzeThe last column%utilYou can see how much each disk writes, andrrqpm/sAs well aswrqm/sThe read/write speeds are respectively used to locate the faulty disk.

In addition, we also need to know which process is doing the reading and writing, generally develop their own idea, or use the iotop command to locate the source of file reading and writing.But this is tid, we need to convert to pid, we can find the PID by readlinkreadlink -f /proc/*/task/tid/.. /...Once you find the PID, you can see how the process reads and writescat /proc/pid/ioWe can also use the lsof command to determine specific file reads and writeslsof -p pid

memory

The troubleshooting of memory problems is more troublesome than CPU, and the scene is more. This includes OOM, GC issues, and off-heap memory. Normally, we would use it firstfreeCommand to check the various conditions of the memory.

Within the heap memory

Memory problems are mostly heap memory problems. It is mainly divided into OOM and StackOverflow.

OOM

JMV memory is insufficient. OOM can be divided into the following categories:

Exception in thread “main” java.lang.OutOfMemoryError: unable to create new native threadThis means that there is not enough memory to allocate the Java stack to the thread. Basically, the thread pool code is writing the problem, such as forgetting to shutdown, so you should look at the code level first, using jstack or jmap. If all is well, the JVM aspect can be specifiedXssTo reduce the size of a single Thread stack. It can also be modified at the system level/etc/security/limits.confNofile and nproc to increase OS thread restrictions

Exception in the thread “main” Java. Lang. OutOfMemoryError: Java heap space the mean heap memory footprint has been at the upper end of the -xmx setting should be one of the most common OOM error. The solution is still to find in the code, suspected memory leaks, through jStack and JMap to locate the problem. If all is well, you need to expand memory by adjusting the value of Xmx.

Caused by: java.lang.OutOfMemoryError: Metaspace = XX:MaxMetaspaceSize; Metaspace = XX:MaxPermSize; Metaspace = XX:MaxPermSize

Stack Overflow

Stack memory overflow, which you see a lot. The Exception in the thread “main” Java. Lang. StackOverflowError said thread stack need memory is greater than the value of Xss, was also the first testing, parameters of adjustment through Xss, but adjustment is too large and may lead to OOM.

Use JMAP to locate code memory leaks

We use JMAP for OOM and StackOverflow code inspections abovejmap -dump:format=b,file=filename pidTo export dump fileEclipse Memory Analysis Tools (MAT) is used to import dump files for Analysis. Generally, we choose Leak Suspects directly, and MAT gives suggestions for Memory Leak. Alternatively, you can select Top Consumers to view the maximum object report. You can select Thread Overview to analyze problems related to threads. In addition, select the Histogram class overview to do your own analysis, as you can do in mat’s tutorial.

In everyday development, memory leaks in code are common and subtle, requiring more attention to detail. For example, new objects are created on every request, resulting in a large number of repeated object creation. File stream operation performed but not closed properly; Manually triggering gc improperly; If the ByteBuffer is not allocated properly, the code will be OOM.

On the other hand, we can be specified in the launch parameters – XX: + HeapDumpOnOutOfMemoryError to save OOM when the dump file.

Gc issues and threads

Gc problems affect not only CPU but also memory. The troubleshooting method is the same. Jstat is used to check generational changes, such as youngGC or fullGC. Is the growth of EU, OU and other indicators abnormal? Too many threads and not getting gc in time will also trigger oom, mostly as mentioned earlierunable to create new native thread. In addition to jStack’s detailed analysis of dump files, we usually first look at the overall thread, throughpstreee -p pid |wc -l.Or just look it up/proc/pid/taskIs the number of threads.

Out of memory

If you run into an out-of-heap memory overflow, that’s unfortunate. If it is caused by the use of Netty, the error log may contain an OutOfDirectMemoryError. If it is directly DirectByteBuffer, the error log may contain an OutOfDirectMemoryError. That will report OutOfMemoryError: Direct Buffer Memory.

Out-of-heap memory overflow is usually associated with NIO usage, and we usually use PMAP first to check the memory usage of the processpmap -x pid | sort -rn -k3 | head -30View the top 30 memory segments with pid in reverse order. You can run the command again after a certain amount of time to see the memory growth, or where the memory segment is more suspicious than the normal machine.If we determine that there is a suspicious memory side, we need to analyze it through GDBGDB --batch --pid {pid} -ex "dump memory filename.dump {memory start address} {memory start address + memory block size}"Heaxdump can be used to query dump fileshexdump -C filename | lessBut most of what we see is binary gibberish.

NMT is a new HotSpot feature introduced in Java7U40. With the JCMD command we can see the specific memory composition. You need to add -xx :NativeMemoryTracking=summary or -xx :NativeMemoryTracking=detail to the startup parameter, which causes slight performance loss.

In general, if out-of-heap memory grows slowly until it explodes, you can set a baselinejcmd pid VM.native_memory baseline.And then wait for a period of time to look at the memory growth, passjcmd pid VM.native_memory detail.diff(summary.diff)Do a summary or detail level DIff.As you can see, the JCMD analysis of memory is very detailed, including in heap, thread and GC (so the above mentioned other memory exceptions can be analyzed by NMT). In this aspect, we focus on the growth of Internal memory, if the growth is too obvious, then there is a problem. At the detail level, there is also the growth of specific memory segments, as shown in the figure below.

At the system level, we can also use the strace command to monitor memory allocationstrace -f -e "brk,mmap,munmap" -p pidThe memory allocation information mainly includes PID and memory address.

The key is to look at the error log stack, find the suspicious object, understand its recycling mechanism, and then analyze the corresponding object. For example, DirectByteBuffer allocates memory that requires full GC or manual system. GC (so it is best not to use -xx :+DisableExplicitGC). Jmap-histo: Live PID triggers fullGC manually to see if the out-of-heap memory has been collected. If it is recycled, there is a high probability that the off-heap memory itself is allocated too small, adjusted by -xx :MaxDirectMemorySize. If nothing changes, it is time to use JMap to analyze objects that cannot be gc and references to DirectByteBuffer.

The GC problem

Heap memory leaks are always associated with GC exceptions. However, GC issues are not only related to memory problems, but also can cause CPU load, network problems and other complications, but more closely related to memory, so we will summarize GC issues separately here.

In the CPU chapter, we described using jstat to get the current GC generational change information. Verbose: gc-xx :+PrintGCDetails -xx :+PrintGCDateStamps -xx :+PrintGCTimeStamps to enable GC logging. Common Young GC and Full GC log meanings will not be described here.

By looking at gc logs, we can roughly infer whether youngGC and fullGC are too frequent or too time-consuming, which can be used appropriately. The G1-XX:+UseG1GC is also recommended.

YoungGC is too frequent. If youngGC is too frequent, there are too many small objects with short cycles. If youngGC is too frequent, consider whether the Eden/new generation setting is too small. If the parameters are ok, but the young gc frequency is still too high, we need to use Jmap and MAT to further check the dump file.

YoungGC The youngGC time is too longThe problem of taking too long depends on where the time is consumed in the GC log. Take G1 logs as an example, you can pay attention to Root Scanning, Object Copy, and Ref Proc. Ref Proc takes a long time, so be careful which objects you refer to. Root Scanning takes a long time, so pay attention to the number of threads and cross-generation references. Object Copy needs to pay attention to the Object life cycle. And time analysis requires horizontal comparisons, that is, with other projects or normal time periods. For example, in the figure, Root Scanning and normal time ratio increase more, there are too many threads.

In fullGCG1, more mixedGC is triggered, but mixedGC can be checked in the same way as youngGC. When fullGC is triggered, it’s usually a problem. G1 degenerates to the Serial collector to clean up garbage and pauses for seconds. Reasons for fullGC may include the following, as well as some ideas for tuning parameters:

  • Concurrent phase failure: During the concurrent marking phase, the old age is filled before the MixGC, at which point G1 will abandon the marking cycle. In this case, you may need to increase the heap size or adjust the number of concurrent token threads-XX:ConcGCThreads.
  • Promotion failed: There was not enough memory for the alive/promoted object at GC, so Full GC was triggered. At this point you can go through-XX:G1ReservePercentTo increase the percentage of reserved memory, reduce-XX:InitiatingHeapOccupancyPercentTo activate the tag ahead of time,-XX:ConcGCThreadsTo increase the number of tagged threads is also possible.
  • Large object allocation failure: If a large object cannot find a suitable region space to allocate, fullGC will be performed. In this case, memory can be increased or increased-XX:G1HeapRegionSize.
  • The program actively executes system.gc () : Don’t just write it.

In addition, we can configure -xx :HeapDumpPath=/ XXX /dump.hprof to dump fullGC files and jinfo to dump fullGC files before and after GC

jinfo -flag +HeapDumpBeforeFullGC pid 
jinfo -flag +HeapDumpAfterFullGC pid
Copy the code

This results in two dump files. After comparison, focus on the problem objects discarded by GC to locate the problem.

network

Problems related to the network level are generally more complex, more scenes, difficult to locate, become the nightmare of most development, should be the most complex. Some examples will be given here, and the TCP layer, application layer, and tool usage will be explained.

timeout

Timeout errors are mostly at the application level, so this one focuses on understanding the concept. Timeouts can be broadly divided into connection timeouts and read and write timeouts. Some client frameworks that use connection pooling also have access timeouts and idle connection clearing timeouts.

  • Read/write timeout. ReadTimeout /writeTimeout, which some frameworks call so_timeout or socketTimeout, refers to data read and writeTimeout. Note that most of the timeouts here are logical timeouts. Soa timeouts are also read timeouts. Read and write timeout is usually set only for clients.
  • The connection timed out. ConnectionTimeout, the maximum time for a client to establish a connection with a server. On the server side, connectionTimeout is a bit different. Jetty indicates the idle connection cleanup time, and Tomcat indicates the maximum connection duration.
  • The other. Including connection fetch timeout connectionAcquireTimeout and idle connection clearing timeout idleConnectionTimeout. A client or server framework that uses connection pooling or queues.

In setting various timeout periods, we need to ensure that the client timeout is smaller than the server timeout to ensure the normal termination of the connection.

In real development, the biggest concern would be that the interface’s read and write times out.

How to set a reasonable interface timeout is a problem. If the timeout setting of the interface is too long, it may occupy too many TCP connections on the server. If the interface is set too short, it will time out very frequently.

Another problem is that the client continues to time out even though the server interface has been reduced. This problem is simple. The link between the client and the server includes network transmission, queuing, and service processing. Each link may cause time consuming.

TCP queue Overflow

TCP queue overflow is a relatively low-level error that can cause more superficial errors such as timeouts and RST. So the mistakes are more subtle, so let’s talk about them separately.

As shown in the figure above, there are two queues: syns queue(half connection queue) and Accept queue(full connection queue). After receiving the SYN from the client, the server places the syn and ACK message to the syns queue. The server receives the ACK message from the client. If the Accept queue is not full, the server sends the ACK message to the syns queue. Take the temporary message from syns queue and place it in accept queue, otherwise follow tcp_ABORt_on_overflow instructions.

Tcp_abort_on_overflow 0: If the accept queue is full at step 3 of the three-way handshake, the server throws the ack sent by the client. Tcp_abort_on_overflow 1 indicates that in step 3, if the full connection queue is full, the server sends an RST packet to the client, indicating that the handshake process and the connection are discarded. This means there may be a lot of Connection reset/Connection reset by peer in the log.

So in real development, how can we quickly locate TCP queue overflow?

The netstat command, perform netstat -s | egrep “listen | listen”As shown in the figure above, Overflowed represents the number of fully connected queue overflows and Sockets dropped represents the number of half-connected queue overflows.

Ss command to run ss-lntSend-q indicates that the maximum size of the full connection queue on the LISTEN port in column 3 is 5. Recv-q indicates how much the full connection queue is currently used.

Let’s see how to set the full connection queue size and half connection queue size:

The size of the fully connected queue depends on min(Backlog, somaxconn). Backlog is passed in when sockets are created, and somaxconn is an OS level system parameter. The size of the half-connection queue depends on Max (64, /proc/sys/net/ipv4/tcp_max_syn_backlog).

In daily development, we often use servlet containers as servers, so we sometimes need to pay attention to the size of the container’s connection queue. In Tomcat the backlog is called acceptCount; in Jetty it is acceptQueueSize.

RST abnormal

The RST package represents a connection reset and is used to close unwanted connections, usually to indicate an abnormal close, as opposed to four waves.

In real development, we often see connection reset/Connection reset by peer errors as a result of RST packages.

Port does not exist

If a server sends a SYN request for a nonexistent port, it returns an RST message to break the connection.

Actively replaces the FIN to terminate the connection

Generally speaking, FIN packets are used to close normal connections. However, AN RST packet can be used to terminate the connection. In practical development, SO_LINGER value can be set to control, often deliberately, to skip TIMED_WAIT, provide interaction efficiency, and be cautious when not idle.

If an exception occurs on one side of the client or server, the client sends an RST to the peer to close the connection

The TCP queue overflow that we talked about above and sending RST packets actually falls into this category. This is usually because one party can no longer process the connection properly for some reason (e.g. the application crashes or the queue is full) and tells the other party to close the connection.

The received TCP packet was not in the known TCP connection. Procedure

For example, one party loses a TCP packet due to the poor network, and the other party closes the connection. After a long time, the other party receives the missing TCP packet, but the corresponding TCP connection no longer exists. In this case, the party directly sends an RST packet to open a new connection.

If one party does not receive any acknowledgement packet from the other party for a long time, it sends an RST packet after a certain period of time or retransmission times

This is mostly related to the network environment. Poor network environment may result in more RST packets.

As mentioned earlier, too many RST packets will cause the program to report an error. If a read operation is performed on a closed connection, connection reset will be reported, while if a write operation is performed on a closed connection, Connection reset by peer will be reported. You may also see the broken pipe error, which is a pipe-level error that reads or writes to a closed pipe and continues reading or writing datagrams after receiving a CONNECTION reset error, as described in the glibc source code comments.

How do we determine the existence of RST packages during troubleshooting? Of course, packets are captured using the tcpdump command and analyzed using wireshark.tcpdump -i en0 tcp -w xxx.cap, en0 indicates the listening nic.

Next, you can open the captured packages using Wireshark and see the following figure. The red ones represent the RST packages.

TIME_WAIT and CLOSE_WAIT

We all know what TIME_WAIT and CLOSE_WAIT mean. Online, we can directly use the command netstat -n | awk ‘/ ^ TCP / {+ + S [$NF]} END {for (S) in a print a, S [a]}’ to view the amount of time – wait and close_wait

With ss command will faster ss – ant | awk ‘[$1]} {+ + S END {for (S) in a print a, S [a]}’

img

TIME_WAIT

Time_wait exists both for lost packets to be reused by subsequent connections and for normal connection closure within the 2MSL time range. Its presence would actually reduce the number of RST packages.

Excessive time_wait tends to occur in scenarios where short connections are frequent. In this case, some kernel parameters can be tuned on the server side:

# enables reuse. Net.ipv4. tcp_tw_reuse = 1 # Enables fast reclaiming time-wait Sockets from TCP connections. Net.ipv4. tcp_tw_recycle = 1Copy the code

Tcp_max_tw_buckets = tcp_max_TW_buckets “Time wait bucket table overflow”

CLOSE_WAIT

The problem with CLOSE_wait is that the application fails to send the FIN packet after ACK. Close_wait is even more likely to occur than time_wait, with more serious consequences. It is often because something is blocked and the connection is not closed properly, gradually consuming all the threads.

To locate this problem, it is best to use JStack to analyze the thread stack to troubleshoot the problem, as described in the section above. Here is just one example.

The developer said that CLOSE_WAIT had been increasing since the application was launched until it died. After jstack, I found a suspicious stack that most threads were stuck in the countdownlatch.await method. The exception is only a common class not found after the simplest SDK upgrade.