Engineering practice: How to print program logs properly?

A long time ago, a friend of mine asked me, if you were asked to take over an old project for follow-up maintenance, where would you start to get yourself started faster? At that time, I did not answer this friend’s question positively. I said: Whether an old project is easy to learn, a very key point is whether the project’s log is good enough. Generally speaking, an old project is relatively stable, and there is a large probability that there will not be a relatively large change or change in the future. Therefore, the core of such a project is “stability maintenance”. However, no one can guarantee that a project running online will not fail. In the event of online problems or failures, how to quickly stop loss is the first priority, and the log plays a very important role in stopping loss process. The log type is clear enough to help developers and operations quickly locate the problem and determine what to do to stop the loss.

Today let’s talk about how to make a project log “good”. The following is the outline table of contents:

Why do I need to print program logs regularly?

How to print program logs in a regular way?

Please forgive me if there is anything wrong, and welcome criticism and correction.

Please respect the author’s labor achievements, please mark the link of the original text:

https://www.cnblogs.com/dolphin0520/p/10396894.html

Why do I need to print program logs regularly?

In the process of writing program code, we usually focus on function realization and often ignore the importance of logs. However, logs are extremely important after the system goes online, because only logs can be used to understand the running status of the current system. In case of online faults, The clarity of the log determines whether you can quickly find a stop loss plan. We can look at the following code:

public class HttpClient {
    private static final Logger LOG = LoggerFactory.getLogger(HttpClient.class);

    private static int CONNECT_TIMEOUT = 5000;   // unit ms
    private static int READ_TIMEOUT = 10000;     // unit ms
    
    public static String sendPost(String url, String param) {
        OutputStream out = null;
        BufferedReader in = null;
        String result = "";
        try {
            URL realUrl = new URL(url);
            URLConnection conn = realUrl.openConnection();
            conn.setDoInput(true);
            conn.setDoOutput(true);
            conn.setConnectTimeout(CONNECT_TIMEOUT);
            conn.setReadTimeout(READ_TIMEOUT);
            conn.setRequestProperty("charset"."UTF-8");
            out = new PrintWriter(conn.getOutputStream());
            out.print(parm);
            out.flush();
            in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            String line;
            while((line = in.readLine()) ! = null) { result += line; } } catch (Exception ex) { LOG.error("post request error!!!");
        } finally {
            try {
                if(out ! = null) { out.close(); }if (in! = null) { in.close(); } } catch (IOException ex) { LOG.error("close stream error!!!");
            }
            returnresult; }}}Copy the code

One day, a large number of HTTP requests fail online. Then I check the logs and find a large number of post Request errors!! Error, at this point if you see such a log you may not know exactly what the cause is, but still have to continue through some other means to locate the specific cause.

Suppose the printed error log looks like this:

post request error!!! , url:[http://www.123.test.com], param:[name=jack] java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579)Copy the code

Connection refused (www.123.test.com) is the domain name of the HTTP service. Stop loss can be quickly found by the appropriate personnel. Avoid wasting a lot of time during fault location.

The above example is only a very small one, and the online problems you might encounter in daily development are much more complex and intractable. In summary, the main functions of logging are as follows:

1) The log is the “magic mirror” of the system operation, through which the running status of the system can be reflected in real time;

As shown in the figure above, in system A, producer keeps producing data and putting it into the data queue, while sender keeps taking data from the data queue and sending it to the receiver of downstream system B. Therefore, for system A, the amount of data to be sent in the data queue is A very key indicator. It can reflect the running status of the current system from the side. If the number of elements in the data queue exceeds 90% of the capacity, it indicates that the system may not run properly at this time, and there will be the risk of queue congestion. If the number of elements in a data queue is less than 10% of its capacity, it indicates that the system is running normally and the risk of queue congestion is low.

If this indicator is not output to the log, the development and operation personnel cannot know the running status of the current system A (of course, there are other ways to obtain this indicator, such as exposing the HTTP interface is one way).

2) Good log is convenient for later operation and developers to quickly locate online problems, speed up the stop-loss speed and reduce the loss caused by system failures;

3) Another function of logs is that they can be seamlessly combined with the monitoring system. Logs can be collected by the monitoring system to obtain relevant performance indicators of system operation, which is conducive to analyzing performance bottlenecks of the system and avoiding risks in advance.

For example:

Suppose you have a shopping mall system. At the beginning, the database provides services through two servers (one master, one slave). At this point, most interfaces can respond to user requests in seconds. With the passage of time, the mall system users increase gradually, concurrent queries and writing have a certain amount of growth, the amount of data in the database also slowly increase, lead to some SQL queries are getting slower and slower, suddenly one day, the database of the slave machine is down, due to too much slow query completely down, and lead to mall service is not available.

If mall system were recorded in the log time consuming situation of each HTTP request, through the monitoring system configuration log collection, configure corresponding alarm at the same time, it can be found in advance due to business growth performance of the system bottlenecks, in advance, system optimization, such as machine capacity, optimize SQL statements, depots table, etc.), avoiding risk.

4) Facilitate the statistics of business-related index data, and carry out relevant business analysis and function optimization.

For example:

For example, if a search system wants to count the search usage ratio of different regions (such as the north and south regions) in the past week, if the IP address of each query request is printed in the log, it can be easily counted. Otherwise, it needs to go online again and add the log to make statistics.

Therefore, we should pay attention to the normalization of log writing in the daily code writing process, so that it can give full play to its due value, which can help ensure the stable operation of our service and effectively improve the efficiency of system maintenance in the later period.

How to print program logs in a regular way?

Next, let’s talk about how to print logs properly from the following aspects.

2.1 Log file naming

In general, the naming of log files can include the following key information:

Type identification (logTypeName) Log level (logLevel) Log generation time (logCreateTime) Log backup number (logBackupNum)Copy the code

Log type identifier: indicates the function or purpose of a log file. For example, a Web service logs that record HTTP requests are usually named request.log or access.log. The logs used to record the overall running of the service are usually named by the serviceName (serviceName, appKey) or hostName, such as nginx.log.

Log level: It is recommended that logs of all levels be classified by file when printing logs. If logs of all levels are recorded in the same log file, you need to search for logs in the file when locating problems. Log levels include DEBUG, INFO, WARN, ERROR, and FATAL. In actual code writing, strict matching mode or non-strict matching mode can be adopted. In strict matching mode, only INFO logs are printed in INFO log files, and only ERROR logs are printed in ERROR log files. In non-strict matching mode, INFO log files can record INFO, WARN, ERROR, or FATAL logs. WARN log files can record WARN, ERROR, or FATAL logs, and so on.

Log generation time: Add the creation time of a log file to the name of a log file for sorting when searching for a log file.

Log backup number: When cutting logs and rolling logs by file size, you can add the number to the end of the log file name.

2.2 Log Scrolling

Although logs can store key information about the running of the system, due to the limited disk space, we cannot keep logs indefinitely. Therefore, we must have a log rolling policy. Log scrolling usually has the following modes:

The first: scroll by time

Second: scroll by the size of a single log file

Third: scroll by time and the size of a single log file.

Scrolling by time means that a new log file is created at regular intervals. You can usually scroll by hour or by day, depending on the number of system logs to be printed. If the number of system logs is small, you can scroll by day. If the system has a large amount of daily traffic, you are advised to roll by hour.
When a log file reaches a certain size, a new log file is created. It is recommended that the size of a log file not exceed 500 MB. If a log file is too large, it may affect log monitoring and fault locating.
Scrolling by time and individual log file size is typically appropriate for scenarios where you want to keep logs for a certain amount of time, but you don’t want a single log file to be too large. Logback provides this configuration mode, for example, may refer to: logback. Qos. Ch/manual/appe…

There are two key parameters for the log rolling policy: the maximum number of retained logs and the maximum disk space. These two parameters must be set, if not set, it is likely to fill the online machine disk situation.

2.3 Log Level

The log levels are as follows:

Debug /trace, INFO, Warning, Error, fatal

The severity of these log levels increases in order:

Debug/Trace: Debug and trace logs are not suitable for online production environments because they contain a large amount of content. They are generally used for offline debugging. Even if the online environment is to be used, it needs to be controlled by a switch, which is only turned on to locate and track online problems;

Info: info Logs record key system running status, key service logic, and key execution nodes. It is important to remember, however, that INFO logs must not be abused. If they are abused, they are not very different from Debug /trace logs.

Warning: Warning logs record unexpected events that occur during system running. As the name suggests, warning logs remind development and O&M personnel that they need to pay attention to and handle them immediately without human intervention.

Error: Error logs are used to record common errors that occur during system running. Once these errors occur, users’ access or usage is affected and manual handling is required. However, in most production environments, manual intervention is not necessary when error logs occur. In most cases, a comprehensive judgment is made based on the number and duration of error logs.

Fatal: Indicates a fatal system error. A fatal error indicates that the system almost dies and requires immediate manual intervention.

The following simple example to illustrate, if we have such a scenario, we have a salary calculation system, every month need 1 from the employees’ attendance system for all company employees’ attendance data, then according to the attendance data to calculate last month should pay, you need to have a function to get employees’ attendance data from attendance system:

public Map<Long, Double> getEmployeeWorkDaysFromAttendance(int year, int month, Set<Long> employeeList) throws BusiessException {// Entry key logs. Key parameters need to be printed. Because there may be a large number of employeeList entries, the contents of the employeeList list are not directly printed. Just print size logger.info("get employee work days, year:{}, month:{}, employeeList.size:{}", year, month, employeeList.size()); // To check the list of employees temporarily, enable the debug log functionif (debugOpen()) {
            logger.debug("employ list content:{}", JSON.toJsonString(employeeList));
        }
        
        int retry = 1;
        while (retry <= MAX_RETRY_TIMES) {
            try {
                Map<Long, Double> employeeWorkDays = employeeAttendanceRPC.getEmployeeWorkDays(year, month, employeeList);
                logger.info("get employee work days success, year:{}, month:{}, employeeList.size:{}, employeeWorkDays.size:{}", year, month, employeeList.size(), employeeWorkDays.size());
                return employeeWorkDays;
            } catch (Exception ex) {
                logger.warning("RPC invoke failed (employeeAttendanceRPC getEmployeeWorkDays), retry times: {...}), year: {}, the month: {}, employeeList. Size: {}", retry, year, month, employeeList.size(), ex); // For languages that do not have an exception mechanism, error logs should be printed hereif (retry == MAX_RETRY_TIMES) {
                    throw new BusiessException(ex, "rpc invoke failed(employeeAttendanceRPC.getEmployeeWorkDays)"); } } retry++; }}Copy the code

2.4 Selection of log printing timing

Logs are used to learn about the current running status of the system and locate online problems. Therefore, the timing of printing logs is very important. If logs are abused, too many logs will be generated, affecting the efficiency of locating problems. If too few logs are printed, critical logs may be missing. As a result, fault root tones cannot be found when faults are located online. Therefore, it is important to know when to print logs. The following are common times for printing logs:

1) HTTP call or RPC interface call

When a program calls another service or system, it needs to print the interface call parameters and call results (success/failure).

2) Abnormal program

When an exception occurs, the program either chooses to throw it up or must print the exception stack in the catch block. Note, however, that it is best not to print an exception log repeatedly, such as an exception thrown up and an error log in a catch block (except for the entry to the external RPC interface function).

3) Special conditional branches

When the program enters special conditional branches, such as special ELSE or switch branches. For example, we calculate salaries based on years of service:

 public double calSalaryByWorkingAge(int age) {
        if (age < 0) {
            logger.error("wrong age value, age:{}", age);
            return 0;
        }
        // ..
    }Copy the code

In theory, you can’t work for less than zero, so you need to print out the unexpected, which is possible by throwing an exception.

4) Key execution paths and intermediate states

Record key log information about key execution paths and intermediate status. For example, an algorithm may be divided into many steps. Record the output result of each step to facilitate locating and tracking the algorithm execution status.

5) Request entry and exit

You need to print entry/exit logs at the entry/exit of functions or external interfaces to facilitate follow-up log statistics and system running status monitoring.

2.5 Log Content and Format

Log printing timing determines whether problems can be located based on logs, and log content determines whether problems can be quickly found based on logs. Therefore, log content is also critical. In general, a log line should contain at least the following components:

LogTag, Param, exceptionStacktrace

The logTag is the log identifier, which identifies the log output scenario or reason, the parAM is the function call parameter, and the exceptionStacktrace is the exception stack. For example:

good case

public class HttpClient {
        private static final Logger LOG = LoggerFactory.getLogger(HttpClient.class);

        private static int CONNECT_TIMEOUT = 5000;   // unit ms
        private static int READ_TIMEOUT = 10000;     // unit ms

        public static String sendPost(String url, String param) {
            OutputStream out = null;
            BufferedReader in = null;
            String result = "";
            try {
                URL realUrl = new URL(url);
                URLConnection conn = realUrl.openConnection();
                conn.setDoInput(true);
                conn.setDoOutput(true);
                conn.setConnectTimeout(CONNECT_TIMEOUT);
                conn.setReadTimeout(READ_TIMEOUT);
                conn.setRequestProperty("charset"."UTF-8");
                out = new PrintWriter(conn.getOutputStream());
                out.print(parm);
                out.flush();
                in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
                String line;
                while((line = in.readLine()) ! = null) { result += line; }} catch (Exception ex) {// Have a keylogTag, has parameter information, has error stack log.error ("post request error!!! , url:[[}], param:[{}]", url, param, ex);
            } finally {
                try {
                    if(out ! = null) { out.close(); }if (in! = null) { in.close(); } } catch (IOException ex) { LOG.error("close stream error!!! , url:[[}], param:[{}]", url, param, ex);
                }
                returnresult; }}}Copy the code

bad case

public class HttpClient {
    private static final Logger LOG = LoggerFactory.getLogger(HttpClient.class);

    private static int CONNECT_TIMEOUT = 5000;   // unit ms
    private static int READ_TIMEOUT = 10000;     // unit ms
    
    public static String sendPost(String url, String param) {
        OutputStream out = null;
        BufferedReader in = null;
        String result = "";
        try {
            URL realUrl = new URL(url);
            URLConnection conn = realUrl.openConnection();
            conn.setDoInput(true);
            conn.setDoOutput(true);
            conn.setConnectTimeout(CONNECT_TIMEOUT);
            conn.setReadTimeout(READ_TIMEOUT);
            conn.setRequestProperty("charset"."UTF-8");
            out = new PrintWriter(conn.getOutputStream());
            out.print(parm);
            out.flush();
            in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            String line;
            while((line = in.readLine()) ! = null) { result += line; }} catch (Exception ex) {// There is no error message log.error ("post request error!!!");
        } finally {
            try {
                if(out ! = null) { out.close(); }if (in! = null) { in.close(); } } catch (IOException ex) { LOG.error("close stream error!!!");
            }
            returnresult; }}}Copy the code

In addition, for external HTTP or RPC interfaces, it is desirable to have a requestId for each request so that all subsequent execution paths of each request can be tracked.

Reference article:

zhuanlan.zhihu.com/p/27363484

Blog.csdn.net/zollty/arti…

www.jianshu.com/p/59cd61eb9…

www.jianshu.com/p/6149463ae…

blog.jobbole.com/56574/

www.cnblogs.com/kofxxf/p/37…

Gitbook. Cn/books / 5 ae68…

www.cnblogs.com/xybaby/p/79…

Blog.didispace.com/cxy-wsm-zml…

www.kancloud.cn/digest/java…

Blog.csdn.net/bad_yu/arti…

Engineering practice: How to print program logs properly?