Journal like an expert

The original connection

Logs are like crime scene evidence, and developers are like CSI. Logs play a critical role in developers investigating bugs and the cause of server outages, and just as a lack of evidence leads to unsolved cases, a lack of meaningful logs can make troubleshooting difficult or even impossible. I’ve seen people struggle with troubleshooting tools (like tcpdump and Strace) or deploying new code to the server when it’s near a crash to get more logs to analyze the problem.

As the saying goes “A beard well lathered is half doctors’ chances of beating cancer “, all professional developers must know how to keep A diary effectively This article not only addresses how to provide a good application logging practice, but also explains why, such as when to log and what to log.

How to debug

It’s hard to talk about logging without talking about what a program is and how to log for debug.

Program as if you were changing states.

A program is a series of state changes

A state is what a program stores in memory at a certain point in time. The program’s code defines how the program transitions from one state to another. Developers using imperative programming languages, such as Java, focus more on stored procedures than on state. But thinking of a program as a series of states is a very important idea, because states give insight into what the program should do rather than how it does it.

For example, if a robot wants to fill up my car’s gas tank, and the explanation is simply to change state from (Tank =empty, money=$50) to (Tank =full, money=$15), If we describe the process in detail: the robot needs to find the gas station, drive the car to the gas station, and pay. While this process is very important, state transitions are a more intuitive way to measure the correctness of the program.

Debug

Debug is in on the return process of state transform of thinking, in my mind playback program which input, after a series of state transformation and then produce the output, in order to judge whether it is where the mistakes, in development, developers use the Debug tool to support the program runs in their mind, but in a production environment, It’s hard to use debug tools, and it’s more about logging.

What should I write in a journal

With the debug definition, it’s pretty straightforward what logs should write.

The log should contain enough information to allow us to reconstruct state transitions

Capturing the state at all points in time is impractical and unimportant, and police only need rough sketches rather than holograms to catch criminals. In the same way with logs, developers need to log when critical state changes occur, and logs need to record key characteristics as well as the current state and the reason for the state change.

Critical state transformation

Not all state transitions are worth documenting. The key to identifying critical states is to treat the program as a series of state transitions, divide state transitions into phases, and then focus on phase transitions.

So for example,

Suppose an app starts with three phases

  • Load the configuration

  • Connection depends on

  • Start the service

It is necessary to log when an application starts and ends. When an application import dependency hangs, the log can be clearly displayed. The application loaded its configuration, entered the dependency connection state, but did not complete.

The key features

Logging is like profiling your program; only the core business logic should be captured. When you have sensitive data like PII(personal information), you need to log it, but you have to obfuscate it, or record it carefully.

For example, when an HTTP server transitions from a “pending request state” to a “accept request” state, it should record the HTTP request and THE URL. Other elements, such as the request header and the request body, are recorded only if they affect the business logic. For example, request headers should only be logged if the server state changes significantly, such as from Content-Type:application/json to Content-Type:multipart/form-data.

Cause of state change

Recording the reason for the state change is useful for debugging. Logs should simply contain before and after state changes and explain why, they can help developers connect various event points and lead them to reconstruct the execution of the application.

The sample

A simple example: Suppose the server receives an abnormal SSN number and the developer wants to log it.

Some anti-pattern logs lack critical states and why:

  • [2020-04-20T03:36:57 +00:00] server.go: Indicates an Error processing request

  • [2020-04-20 t03:36:57 + 00:00] server. Go: SSN rejected

  • Server. go: SSN Rejected for user UUID “123e4567-E89B-12D3-A456 — 426655440000”

This contains some information, but does not address some of the questions that developers may see in the logs, such as: What kind of request is causing the service problem? Why was the SSN rejected? Which users are affected? A good debug would look like this:

[2020-04-20T03:36:57 +00:00] Server. go: Received a SSN Request (Track ID: E4a49a27 — 1063 — 4AB3 — 9075-cf5faec22a16) from user uuid “123e4567-e89b-12D3-a456 — 426655440000”, Reality was much like this because the server is expecting SSN format AA-GG-SSSS but got **-**

Who should keep a journal

In many cases, “who” is logging can lead you to fall into the trap of writing from the wrong function, which can cause the log to contain a lot of duplicate or useless information.

Think of the program as a layer of abstraction

Many well-constructed modern programs are like pyramids of layers of abstraction, with higher-level classes/functions breaking complex tasks into smaller ones, and lower-level classes/functions acting as black boxes to perform these smaller tasks and provide interfaces for higher-level people to call. This approach makes programming very simple because each layer only needs to focus on its own logic and doesn’t have to worry about the overall context.



For example, a website should consist of a business logic layer, an HTTP layer, and a TCP/IP layer. When responding to a URL request, the business logic layer focuses on which web page to display and passing the web content to the HTTP layer, which packages the content into an HTTP response, which the TCP/IP layer then splits into TCP packets and sends out.

Do not log at the wrong level

As a result of the level of abstraction, different levels have different information about the task in progress. In the previous example, the HTTP layer has no idea how the TCP layer sent the packet, much less what the user intended when requesting the URL. When choosing to log, the developer needs to choose the right level. And a correct understanding of the overall state change and cause.

In our SSN validation rule example, the logic for verifying SSN jumps to an SSN validator class:

public class Validator {
    // other validation functions ...
 
  public static void validateSSN(String ssn) throws ValidationException {
    // do the validation
    String regex = "^ (? ! 000 | 666) [0 to 8] [0-9] {2} - (? ! 00) [0-9] {2} - (? ! 0000) [0-9] {4} $";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(ssn);
 
    if(! matcher.matches()) {// --> Log Location A <--
      logger.info("Bad SSN blah, blah, blah...");
      throw new ValidationException(String.format("expecting SSN format AAA-GG-SSSS but got %s", ssn.replaceAll("\\d"."*"))); }}}Copy the code

There is another function that validates and updates user information, which calls the SSN validator:

public class Validator {
    // other validation functions ...
 
  public static void validateUserUpdateRequest(UserUpdateRequest req) {
    // validate other attribute of req ...
 
    try {
      validateSSN(req.ssn);
    } catch (ValidationException e) {
      // --> Log Location B <--
      logger.info(String.format("Received a user update request(track id %s) from user uuid %s, rejecting it because %s", req.trackID, req.uid, e.getMessage()));
      // other error handling logic ...
    }
 
    // other logic ...}}Copy the code

There are two places, A and B, to log, but B has enough information to log, and at A, the program doesn’t know what request it’s responding to, or which user is requesting it, Log is not much help and increase the redundancy, validateUserUpdateRequest to throw an error to the caller (validateRequest) is more reasonable, because the caller contains more context to write logs.

However, this does not mean that logging at the bottom level is not important, especially if the bottom level cannot expose errors to higher levels. For example, the network layer includes a retry mechanism, which means that higher levels are not aware of network problems. In general, low-level applications can write more DEBUG logs than INFO logs to avoid redundancy, and developers can adjust higher log levels to get more information if necessary.

How much log should I write?

There is an obvious trade off between useful information and log volume. The more meaningful log writes, the easier it is to reproduce state transitions. There are two ways to help you control log volume:

Estimate the relationship between logs and your workload

To control log volume, it is important to estimate first. Most programs have two workloads:

  • Accept a work item (request) and respond

  • Put your work in one place (pooled) and process it.

Most of the logs are triggered by the workload. The more workload an application has, the more logs it writes. There are also logs that are irrelevant to the workload, but the number of logs becomes irrelevant when the application is loaded. Developers need to calculate the number of logs and the amount of work

Number of logs =X X Number of jobs + constant logs

X can be calculated by examining the code. Developers need to have a good idea of the X value of their program and be able to estimate log capacity and budget. Some common sense about X:

  • 0<X<1: this means that the log is sampling or useless, such as logging only error logs, or using other log sampling algorithms, which can reduce the size of the log but also limit troubleshooting.

  • X~1: This means that there is only one rough log for each job, but it is a good reason to keep a log with sufficient information (see How to Log)

  • X>>1: there must be a good reason for X to be greater than 1. Because when the workload peaks, for example when the server receives a storm of HTTP requests, the amplification of X leads to a huge workload of log instances, which tends to cause more problems.

Make use of log levels

What happens when X is still very large, even after optimization it’s still very large? If the value of X is still much greater than 1, some logs should be lowered to INFO level so that the program can run at DEBUG level temporarily to provide more information while troubleshooting.

conclusion

In order to log like an expert, the first step is to think of the program as a series of state changes at an abstraction level. The key problem of journaling can be solved when you keep the following key questions in mind:

  • When to log: When a critical state change occurs.

  • What to log: Key information about the current state and the reason for the state change.

  • Who logs: Logs at the right level to ensure that the critical context is owned by the current level of abstraction.

  • How many logs to log: Estimate the number of x-value logs =X \* number of jobs + constant level logs to keep the X value within the appropriate range.