Original: Taste of Little Sister (wechat official ID: XjjDog), welcome to share, please reserve the source.

As a forensic doctor, he is not afraid of highly decomposed bodies, nor of the complexity of the case. The worst part is leaving nothing behind. Nothing, no skill, no experience, no place to start.

The production environment is so complex that a few minutes ago the process was alive and kicking, and now it’s lying there dying. As the first witness, be sure to preserve the scene. Sometimes, the worst thing we can do is get caught in a crossfire, and we don’t want that.

We have a lot of work to do before the life of the process blows over. In this article, we will introduce common methods of cue retention. Finally, automate the process using Shell scripts.

The system environment, the crime scene that doesn’t lie

1. The current network connection of the system

ss -antp > $DUMP_DIR/ss.dump 2>&1
Copy the code

This command outputs all network connections to the ss.dump file. The ss command is used instead of netstat because netstat is very slow to execute when there are many network connections.

It is useful to check for TIME_WAIT or CLOSE_WAIT, or any other problem with too many connections, by checking the status of various network connections.

2. Network state statistics

netstat -s > $DUMP_DIR/netstat-s.dump 2>&1
Copy the code

Example Output the network statistics status to the netstat-s. ump file. It can carry out statistical output according to each protocol and play a very important role in grasping the state of the entire network at that time.

sar -n DEV 1 2 > $DUMP_DIR/sar-traffic.dump 2>&1
Copy the code

This command outputs the current network traffic using SAR. On some very high speed modules, such as Redis and Kafka, full network cards often occur.

3. Process resources

lsof -p $PID > $DUMP_DIR/lsof-$PID.dump
Copy the code

This is a very powerful command. You can see which files a process has open, which is a magic tool that allows you to see the overall resource usage in the process dimension. The output of this command is slow in the case of a large number of resources.

4. CPU resources

mpstat > $DUMP_DIR/mpstat.dump 2>&1
vmstat 1 3 > $DUMP_DIR/vmstat.dump 2>&1
sar -p ALL  > $DUMP_DIR/sar-cpu.dump  2>&1
uptime > $DUMP_DIR/uptime.dump 2>&1
Copy the code

These commands have been described in detail in the article “Cast Away CPU” for Linux. Displays the CPU and load information of the current system for troubleshooting.

The functions of these commands overlap a lot, so users should pay attention to screening.

5. I/O resources

iostat -x > $DUMP_DIR/iostat.dump 2>&1
Copy the code

Generally, I/O resources are normal for compute – oriented service nodes. However, sometimes problems can occur, such as excessive log output, or disk problems. This command displays the basic performance information of each disk to troubleshoot I/O problems.

6. Memory problems

free -h > $DUMP_DIR/free.dump 2>&1
Copy the code

If you are interested in xjjdog out-of-heap memory troubleshooting, you can check out this article. A common problem is that the JVM runs out of memory, as we describe in the processes section.

The free command provides an overview of the memory of the operating system, which is a very important point in troubleshooting.

7. Other global situations

 ps -ef > $DUMP_DIR/ps.dump 2>&1
dmesg > $DUMP_DIR/dmesg.dump 2>&1
sysctl -a > $DUMP_DIR/sysctl.dump 2>&1
Copy the code

In other articles on XJjDog, we’ve talked about dMESG more than once. Dmesg is the last vestige of many services that died quietly.

Of course, ps, as a command with the highest execution frequency, its output information at that time must also have some reference value.

Due to the configuration parameters of the kernel, it can have a great impact on the system. So we also exported a copy.

Process snapshot, last words

1, jinfo

${JDK_BIN}jinfo $PID > $DUMP_DIR/jinfo.dump 2>&1
Copy the code

This command outputs basic Java process information. This includes setting environment variables and parameters.

2. Gc information

${JDK_BIN}jstat -gcutil $PID > $DUMP_DIR/jstat-gcutil.dump 2>&1
${JDK_BIN}jstat -gccapacity $PID > $DUMP_DIR/jstat-gccapacity.dump 2>&1
Copy the code

Jstat will output the current GC information. Generally, a general clue can be seen. If not, jMAP will be used for analysis.

3. Heap information

${JDK_BIN}jmap $PID > $DUMP_DIR/jmap.dump 2>&1
${JDK_BIN}jmap -heap $PID > $DUMP_DIR/jmap-heap.dump 2>&1
${JDK_BIN}jmap -histo $PID > $DUMP_DIR/jmap-histo.dump 2>&1
${JDK_BIN}jmap -dump:format=b,file=$DUMP_DIR/heap.bin $PID > /dev/null  2>&1
Copy the code

Jmap will get the dump information for the current Java process. As shown above, the fourth command is actually the most useful, but the first three will give you an initial overview of the system.

Because the file generated by the fourth command is usually very large. In addition, it is necessary to download and import tools like MAT for in-depth analysis to obtain the results.

4. Execute stack

${JDK_BIN}jstack $PID > $DUMP_DIR/jstack.dump 2>&1
Copy the code

Jstack will fetch the current execution stack. It’s always multiple times, so let’s just take one. This information is very useful to restore the thread condition in your Java process.

top -Hp $PID -b -n 1 -c >  $DUMP_DIR/top-$PID.dump 2>&1
Copy the code

To get more detailed information, we use the top command to get CPU information for all threads in the process. That way, you can see where the resources are being spent.

5. Senior reserve

kill- 3$PID
Copy the code

Sometimes, JStack doesn’t work. There are many reasons, such as the fact that Java processes are hardly responsive. We will try to send a kill -3 signal to the process. This signal is owned by the Java process and will print the JStack trace to the log file. Is an alternative to JStack.

gcore -o $DUMP_DIR/core $PID
Copy the code

An alternative to the problem of jMap not executing is gcore in the GDB component. A core file will be generated. We can use the following command to generate the dump

${JDK_BIN}jhsdb jmap --exe ${JDK}java  --core $DUMP_DIR/core --binaryheap
Copy the code

Transient and historical states

I’m going to create two nouns here for xjjdog. Transients are snapshot-type elements that occur at the time; Historical state refers to a resource change chart captured by frequency with fixed monitoring items.

There is a lot of information on it, such as CPU, such as system memory, etc., and the value of transient is not as intuitive as that of historical state, because it also has a baseline problem. So it would be nice to have a tool like a monitoring system.

However, for lsOF, HEAP, etc., such confounding information without the concept of time series cannot enter the monitoring system and generate useful value, so it can only be analyzed through transient. In this case, the value of the instant is even greater.

End

I’ve written the above procedure as a shell script. You can find it on Github. Click on the lower left corner to view the original text, and also to meet it.

https://github.com/sayhiai/shell
Copy the code

However, it is worth noting that the cause of failure in distributed environments is often unexpected, and your stand-alone evidence may be just a symptom. It does not lie, but the meaning behind it is often a false guide to the nature of the problem.


Think so? Give it a thumbs up.

Related articles: Linux “Cast Away” (1) preparation of Linux “Cast Away” (2) CPU article Linux “Cast Away” (3) memory article Linux “CAST away” (4) I/O article Linux “Cast away” (5) network article Java out-of-heap memory troubleshooting summary