directory

  • 1. Bugs
  • 2. Try to solve the case
  • 3. The truth emerges
  • 4. To summarize
  • Update record
  • Related articles recommended

1. Bugs

In November 2021, a service system on the customer site was down. O&m found that the process of the system was lost, and no error information was reported in the log.

O&m restarts the service, the system runs normally, and then fails again the next day or several days later.

Notable phenomenon: restart the service on Friday, basically no one use the service on weekends, the system runs normally on weekends, go to work on Monday, everyone use the system and then hang.

2. Try to solve the case

Since there is no error log output, the first thought is due to the operating system resource constraints, such as memory, CPU, socket connections, open files, etc. If so, the operating system level kills the process and logs it.

1. View system-level logs

View the system log: cat/var/log/messages | grep -i ‘kill’

Look at the kernel level of logging: dmesg | grep -i ‘kill’

No Java, OOM, or kill system records are found.

2. View THE JVM crash logs

Hs_err_pid_ %p.log file is generated when the JVM crashes. By default, hs_err_pid_%p.log file is generated in the working directory of the JVM. You can also specify the path when the JVM starts: -XX:ErrorFile=/var/log/hs_err_pid_%p.log

No JVM crash logs were found in the working directory.

3. View OOM Heap Dump logs

System startup if set the JVM parameter – XX: + HeapDumpOnOutOfMemoryError and – XX: HeapDumpPath = * / Java. The hprof and crash is caused by OOM, can be found in the corresponding path heap dump file, Analyzing dump files using a tool like JVisualVM can also locate problems.

The system has set the corresponding JVM parameters, the corresponding path does not dump log output.

There are no logs. Could there be a problem that caused the logs not to be generated? Because the customer used it on Monday, the outage occurred, we suspect that there is a memory problem, monitor the memory:

4. Monitor OS process memory and JVM heap memory

Script to monitor OS process memory using the top command:

#! /bin/bash

while true
do
  datetime=$(date '+%Y-%m-%d %H:%M:%s')
  echo "$datetime" >> record_new3.txt 2>&1
  top -d 1 -b -n1 |grep $PID >> record_new3.txt 2>&1
  sleep 60
done
Copy the code

Monitor JVM heap memory using the jmap command:

#! /bin/bash

while true
do
  datetime=$(date '+%Y-%m-%d %H:%M:%s')
  echo "$datetime" >> record_jmap_new3.txt 2>&1
  /home/jdk8u282-b08/bin/jmap -heap $PID >> record_jmap_new3.txt 2>&1
  sleep 120
done
Copy the code

Analysis of logs shows that:

JVM heap memory (as measured by jmap command script monitoring) goes from 3G to nearly 14GB (the JVM set maximum heap memory to 16GB), then the JVM GC, the memory goes down, and so on. No heap memory overflow occurred.

The JAVA process memory (monitored by the top command script) keeps growing until it is around 16 GIGABytes (the system memory is 256 gigabytes, which is still quite abundant).

Some JVM heap memory is reclaimed when the JVM GC is performed, but memory allocated to the operating system JAVA processes is not reclaimed. GC in the JVM heap, release the memory only tag memory space is available, so this is why the system level JAVA process memory has been increased, finally to maintain a larger values (this is the heap memory outside the normal situation, have another case outside the heap memory increased resulting in the memory is too big, process to be killed by the OS, In this case, there should be some other reason for the abnormal growth of out-of-heap memory, so try to find out the cause of the abnormal growth of out-of-heap memory).

The monitoring of this crashed system shows that there is no problem with the memory: after the heap has increased to a certain point, the JVM GC, the heap will drop, and the system process memory will end up at a stable value.

So far, I have done my best to find no real problem and have written a restart script that will restart the system in 5 minutes:

After the system is down, restart the service in 5 minutes

#! /bin/bashwhile true do log_date=$(date '+%Y-%m-%d') sync_date=$(date '+%Y-%m-%d %H:%M:%s') pid=`/usr/sbin/lsof -t -i tcp:7001` logfile="/home/TAS/TAS2810/logs/restart-$log_date.log" if [ ! -z "$pid" ]; Then echo "${sync_date}::PID: $PID, XX system service starting..." >> "$logfile" else echo "${sync_date}::XX system service does not exist, restart..." > > "$logfile" CD/home/TAS/TAS2810 / bin nohup bash/home/TAS/TAS2810 / bin/StartTAS. Sh & echo "${sync_date} : : restart the complete!" >> "$logfile" fi sleep 300 doneCopy the code

3. The truth emerges

To summarize the above situation: the service hangs every once in a while, and there are no error business logs or crash logs at the operating system level, and the monitoring memory is normal.

So what’s the problem? The process must be dead. Killed by who? He killed? OS logs show no kill it, memory or JVM hierarchy crash suicide? I didn’t find any logs.

How else are we supposed to do that? The project sent in an architect for the system. Steady!

Linux has a strack tool that can track a process’s signals, that is, it can monitor how a process exits:

#! /bin/sh

nohup strace -T -tt -e trace=all -p $(netstat -tnalp | grep 7001 | grep LISTEN | awk '{print $7}' | tr -d '/java')  > trace.log &
Copy the code

The next day, the system went down. Through the strack output log, it was found that the system quit by itself. Indeed, there was no homicide or suicide.

In addition, the architect said that it is likely to be the Linux X Server call problem, the local environment is repeated, and sure enough it is the problem.

Concrete is: There are flow-related functions in the system, which will use Java AWT library to call up GUI graphical interface, and the drawing of GUI is to call X DISPLAY Server of the Server startup environment. When the shell window of the Server startup is closed, the client clicks flow function again. The Server cannot find the X DISPLAY Server environment and the system exits by itself.

The awT library can’t find the X DISPLAY Server environment. The awT library can’t find the X DISPLAY Server environment. The awT library can’t find the X DISPLAY Server environment.

To learn more about X DISPLAY Server, please refer to article: How to open Linux GUI program -Window System correctly on SSH Terminal

How to modify it? JVM parameters need to be added: -djava.awt. headless=true. This parameter tells the JVM that the runtime environment does not have hardware such as a display, mouse, or keyboard, and that it can use resources in the background to make AWT-related calls. For example, creating some images in the background does not require a screen). Take a look at the demo to understand this parameter:

import javax.swing.*;
import java.awt.*;
import java.awt.event.*;
public class Calculator {
    static double num;
    public static void main(String[] args) {
        //System.setProperty("java.awt.headless", "true");
        System.setProperty("java.awt.headless"."false");
        System.out.println("Is it a Headless environment?" + java.awt.GraphicsEnvironment.isHeadless());
        System.out.println("Java.awt. headless Default:" + System.getProperty("java.awt.headless"));
        // set up frame
        JFrame frame = new JFrame();
        frame.setSize(500.500);
        frame.setTitle("Simple Calculator");
        frame.setLocationByPlatform(true);
        frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);

        // set up panel
        JPanel panel = new JPanel();
        // set layout to 5x2 grid layout
        panel.setLayout(new GridLayout(5.2));

        // set up answer label
        JLabel answer = new JLabel();

        // set up number text fields
        JTextField num1 = new JTextField();
        JTextField num2 = new JTextField();

        // set up buttons
        JButton add = new JButton();
        add.setText("+");
        add.addActionListener(new ActionListener() {
            @Override
            public void actionPerformed(ActionEvent event) {
                try {
                    num = Double.parseDouble(num1.getText())
                    + Double.parseDouble(num2.getText());
                    answer.setText(Double.toString(num));
                } catch (Exception e) {
                    answer.setText("Error!"); }}}); JButton sub =new JButton();
        sub.setText("-");
        sub.addActionListener(new ActionListener() {
            @Override
            public void actionPerformed(ActionEvent event) {
                try {
                    num = Double.parseDouble(num1.getText())
                    - Double.parseDouble(num2.getText());
                    answer.setText(Double.toString(num));
                } catch (Exception e) {
                    answer.setText("Error!"); }}}); JButton mul =new JButton();
        mul.setText("*");
        mul.addActionListener(new ActionListener() {
            @Override
            public void actionPerformed(ActionEvent event) {
                try {
                    num = Double.parseDouble(num1.getText())
                    * Double.parseDouble(num2.getText());
                    answer.setText(Double.toString(num));
                } catch (Exception e) {
                    answer.setText("Error!"); }}}); JButton div =new JButton();
        div.setText("/");
        div.addActionListener(new ActionListener() {
            @Override
            public void actionPerformed(ActionEvent event) {
                try {
                    num = Double.parseDouble(num1.getText())
                    / Double.parseDouble(num2.getText());
                    answer.setText(Double.toString(num));
                } catch (Exception e) {
                    answer.setText("Error!"); }}});// add components to panel
        panel.add(new JLabel("Number 1"));
        panel.add(new JLabel("Number 2"));
        panel.add(num1);
        panel.add(num2);
        panel.add(add);
        panel.add(sub);
        panel.add(mul);
        panel.add(div);
        panel.add(new JLabel("Answer"));
        panel.add(answer);

        // add panel to frame and make it visible
        frame.add(panel);
        frame.setVisible(true); }}Copy the code

If Oracle JDK1.8 executes the code, the default is java.awt.headless is false and the openJDK defaults to true. The above code opens a simple GUI calculator. If java.awt.headless=true, this tells the JVM that there is no associated display service, and an error is reported:

Why did you report an error? For awT to call up GUI programs, the JVM parameter headless’s true setting tells the JVM runtime environment that no services are displayed and GUI programs are not supported.

Set java.awt.headless=false and execute:

Microsoft VcXsrv X Server is installed in the local environment. The DISPLAY port is set to 3600. Export DISPLAY= 172.26.18.37.3600 in the shell environment where the JVM is started

Java.awt. headless=true For example, to generate images, no display service is used, but the AWT library is used, as shown in the following demo:

import java.awt.Graphics;
import java.awt.image.BufferedImage;
import java.io.File;
import javax.imageio.ImageIO;

public class TestCHSGraphic {

    public static void main (String[] args) throws Exception {
        // Set the Headless mode
        //System.setProperty("java.awt.headless", "true");
        //System.setProperty("java.awt.headless", "false");
        System.out.println("Is it a Headless environment?" + java.awt.GraphicsEnvironment.isHeadless());
        System.out.println("Java.awt. headless Default:" + System.getProperty("java.awt.headless"));

        BufferedImage bi = new BufferedImage(200.100, BufferedImage.TYPE_INT_RGB);

        Graphics g = bi.getGraphics();
        g.drawString(new String("Headless Test".getBytes(), "utf-8"), 50.50);

        ImageIO.write(bi, "jpeg".new File("test.jpg")); }}Copy the code

If java.awt.headless is set to false and there is no X DISPLAY Server in the JVM running environment, the same error will be reported as above.

4. To summarize

  1. Don’t assume that the program must have been killed by suicide or homicide. It turns out the program bailed on its own.
  2. Logs are the best place to start. If you have a log, start with it. The absence of logs can also indicate problems, such as the need to rule out memory or JVM crashes as early as possible in this case.
  3. You can’t rely entirely on logs, which can sometimes be eaten by code mishandled. Trying to reproduce the problem can find a breakthrough.
  4. Some problems can not be found out the reason may be the blind area of knowledge, more understanding of the relevant support can help troubleshoot the problem.
  5. Problem screening to narrow the scope of the investigation, must not take it for granted, like teaching children to practice, a little bit of elimination. Many cases due to their own granted, a small point of negligence, will waste a lot of time.

Update record

  • 2022-01-07 12:15 Nuggets column reread, optimized, erratum before publication

Related articles recommended

  • SSH terminal how to open Linux GUI program -Window System correctly

Wechat public account search “Feng Brother painting Halberd” to follow Feng Brother, the first time to watch more exciting content. Why did the bug live puzzle process disappear?