I. Background introduction

Baidu commercial products serve baidu advertisers to create the product ecology for advertising. Including search promotion, information flow promotion, brand and other promotion channels as well as stargazer, wood fish and other marketing tools.

Panorama of Baidu commercial products

At the bottom of this series of commercial products are complex Java business systems. The complexity is mainly reflected in the multiple sub-systems of microservices at the bottom, complex call relationships between applications, and multiple dependencies of basic components. High complexity means problems are easy to occur, and problems are difficult to locate. However, the problems of these products will directly lead to the success of advertisers advertising or modify the bid, creative and other operations failure.

“If you have worked in the front line of advertising business system, you should know very well that troubleshooting online problems is boring and time-consuming.”

How to locate the problem in the first time, so as to quickly stop the loss and repair the problem is a key technical pain point in the commercial product system. In order to solve this pain point, Baidu Commercial Platform department has created a large-scale distributed micro-service monitoring system. The previous article ** “Baidu Commercial large-scale micro-service business monitoring system – Fengjing” ** has described how Fengjing provides micro-service system performance indicators, business gold indicators, health status, monitoring alarms and so on for baidu’s various business lines through self-developed non-invasive probe and high-performance call chain storage system.

When receiving the online alarm, the student on duty should first find the root cause module of the problem, then find the service interface of the fault module, and finally locate the problem code stack. Fengjing provides call chain data to check the status code and time consumption of each call link, and also includes the error stack printed by the business system.

Table view of call chain provided by Fengjing (red arrow indicates long critical path)

In most cases, the call chain and the error stack printed by the system determine the problem. However, part of the problem is related to more specific scenarios such as user request return, business access caching, and so on. You need to use service logs printed by the system to help locate faults.

Fengjing does not collect and store business logs, which is related to the amount of data. It is deployed on thousands of microservice subsystems and runs in tens of thousands of containers. Hundreds of billions of call chain data are collected every day, and data is stored at terabyte level every day. However, it is estimated that the total amount of service logs in a day is close to PB level, which is too expensive to store.

Two, technical principle

Traditional practice: In order to retrieve all business logs related to a single request, the logs are collected and stored in ES for retrieval.

Log Collection Architecture

Sure, Kibana+ES will provide richer and more flexible retrieval functions, but it’s basically not feasible for a platform-level monitoring system like Phoenix Eye. The resource cost of ES is too expensive, and the daily log data of the whole platform is close to PB. If all storage is in ES, cluster resource consumption and maintenance costs are high. In addition, to simply locate online problems, there is no need for a particularly complex log retrieval function.

Can users see the entire call chain and business logs associated with a single request with a small amount of resource consumption?

The whole iterative process of Fengjing is to use limited resources to creatively solve practical problems. The same is true of a really good system architecture: “adapt to local conditions.” In the book Transformation of Enterprise IT Architecture: Alibaba’s Strategic Thought and Structure Combat, Zhong Hua of Alibaba mentioned at the beginning that “good innovation must be based on the current situation of the enterprise and adapt measures to local conditions”. **

At present, the commercial platform Java system is uniformly deployed on the enterprise-level micro-service hosting platform Jarvis. Meanwhile, Fengjing Probe can track and collect the movement of the system without invasion, which is our technological advantage. Can use this advantage to avoid the short board of limited storage resources. Since the probe can record all the actions of the system in the process of each request occurring; You can also record the actions of the system to print logs.

Schematic diagram of holographic log technology

The business log file name and log offset associated with the request are recorded through the probe and stored in the database. When the user retrieves the service log related to the call chain on the Jarvis management terminal, the system will first obtain the relevant virtual container address, log file name, log offset and other metadata information through the call chain ID, and then obtain the complete log content from the specific container through the metadata, and finally show it to the user.

Holographic log actual product renderings

In this way, we can easily retrieve a large number of business logs related to the call chain even though we consume only a small amount of storage and computing resources. This is limited by how long the logs are actually stored in the container, but online problems rarely need to be analyzed and located with a long history of logs. In most cases, the current log will suffice.

“Holographic log technology is developed by Fengjing and has also applied for relevant patents.”

Third, algorithm implementation

Holographic log technology design is divided into two main parts:

  1. Log metadata collection: Intercepts operations before and after logs to collect metadata.

  2. Metadata parsing: Parsing metadata to locate the current location of the log file and the location of the log file.

Under normal circumstances, a log message may print to multiple log files, log files may according to the configuration of the rolling strategy based on time or the size of the scroll, different time log retrieval needs to be able to automatically distinguish the actual location of the log is currently, users do not need to perceive the bottom log file location change.

Key problem solving in design and implementation:

  1. Performance and accuracy of metadata collection To ensure that metadata can be accurately collected, you need to intercept printed logs based on the phoenix probe.

Metadata items and acquisition schematics

  • The bytecode is inserted before the original log printing operation to record the start time, the offset of the file before printing obtained by reading the file identifier, the file rolling policy (including the maximum file size and file rolling time), and the log level.

  • Insert the bytecode after the original log printing operation, record the end time, offset of the printed file obtained by reading the file identifier, the file currently written to the log content, the policy after the file is archived according to the rolling policy, and the archive number when the file is archived with the same name.

  • Performance is improved by reading the file descriptor directly instead of reading the file contents when collecting file offsets. At the same time, the unique traceId of each invocation is injected into the log print content for more accurate annotation, and other data collection is used for parsing during log retrieval.

  1. Metadata parsing When a user initiates log search, an algorithm is used to resolve the current log location.

Flow chart of retrieval algorithm

  • Query all log metadata records that are the same as the traceId based on the invocation chain.

  • Obtain the end time of log printing, the current file name when the log is printed, and the archive policy.

  • Parsing archiving policies;

  • Inject different baseline parameters according to the archiving strategy, simulate a log printer to get the file location at this point in time;

  • According to the file location and the offset before and after the file, the log content before and after the offset is read.

  • Use traceId for content calibration.


Four, conclusion

Through the holographic log technology developed by Fengjing, the business side can quickly retrieve the complete call chain related to the business request and complete business log. As a distributed tracking system, we have also filled in the last missing piece of tracking. However, the complexity of the business system also determines the challenges fengjing faces as a platform business monitoring product.

About the author:

Li Qiyuan, Senior R&D engineer, Baidu Commercial Platform R&D Department

Responsible for API gateway of commercial platform and phoenix Monitoring system of micro service of commercial platform department. Have more practice and deep understanding of building high performance and high availability distributed system.

Recommended reading:

In-depth understanding of WKWebView (Introduction) – WebKit source debugging and analysis

Fast clip – help the strength of smart clip improve efficiency practice

Short video personalized Push project road to progress

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome to your attention