Introduction: The life cycle of log data includes log collection, access, transmission, and application. The stability of data plays an important role in the construction of corporate statements, decision analysis and transformation strategy. The full text aims to introduce the current status of Baidu log center, the company’s internal application and promotion. Especially in the construction of data accuracy, in-depth discussion. Stability construction of all links from data generation to final business application, including: optimization of timeliness of data reporting, thinking of access persistence, construction of no-lose in the process of data flow calculation, etc.

The full text is 4047 words, and the expected reading time is 12 minutes

A brief

1.1 Positioning of central Taiwan

Log center is a one-stop service for log data, realize the full life cycle management of log data, only simple development can quickly complete log data collection, transmission, management and query analysis and other functions, suitable for product operation analysis, RESEARCH and development performance analysis, operation and maintenance management and other business scenarios. Help clients such as APP end and server end to explore data, mine value and foresee the future.

1.2 Access

Nishi Zhongtai has covered most of the key products in the factory, including Baidu APP, mini program, matrix APP, etc., with the following benefits in terms of access:

  • Access: It almost covers existing apps in the factory, small programs, innovation incubation apps and off-factory acquisition apps

  • Service scale: hundreds of billions of logs/day, peak QPS millions/second, service stability 99.9995 %


1.3 Explanation of Nouns

Client: a software system that can be directly used by users. It is usually deployed on users’ mobile phones or PCS. Such as Baidu APP, small programs and so on.

Server: A service used to respond to network requests initiated by clients, usually deployed on cloud servers.

Log middleware: This refers to the end log middleware, including the capability building of the end log lifecycle. Including the SDK, server, log management platform and other core components.

SDK: Collects, encapsulates, and reports logs. According to different log production ends, the SDK of APP end and H5 end can be divided into general point SDK, performance SDK, and small program SDK according to the scenario. Users can integrate different SDKS based on requirements.

Log server: log receiving server, which is the core module of the log server.

Feature/model service: The log center forwards the point information needed for policy model calculation to the downstream policy Recommendation center in real time. The Feature/model service is the entry module to The Policy Recommendation Center.


1.4 Service Panorama

Log service mainly includes basic layer, management platform, business data application and product support. Around all levels, baidu client log reporting standards were formulated and released in June 2012.

  • Basic layer: it supports app-SDK, JS-SDK, performance SDK, and general SDK, and meets all kinds of fast access scenarios. Rely on big data basic services to distribute dot data to each application party.

  • Platform layer: The management platform supports data metadata information management and maintenance, and controls the whole life cycle of the dot. At the online level, data can be forwarded in real time and offline, and the service stability is 99.995% based on reasonable traffic control and monitoring.

  • Business ability: Log data output to data center, performance platform, strategy center, growth center, etc., effectively assist product decision analysis, end quality monitoring, strategy growth and other fields.

  • Business support: covering key APP, new incubation matrix APP and horizontal general components.

Second, log the central Taiwan core objectives

As mentioned above, the log center is carrying all APP log points in Baidu and stands at the forefront of data production. On the basis of ensuring full function coverage and fast and flexible access, the most important core challenge we face is the accuracy of data. The whole data from output, log center access processing, downstream application, all data quality problems need to log center load. The accuracy of data can be broken down into two parts:

  • Not heavy: Ensure that data does not duplicate strictly. It is necessary to prevent data duplication caused by various retries and abnormal recovery of architecture at the system level.

  • Non-loss: Ensure that data is not lost in a strict sense. Data loss caused by faults at the system level and bugs at the code level must be prevented.

However, to achieve a system level of almost 100% of the non-loss, need to face more problems.


2.1 Log Architecture

The access log data in the middle station goes through the following steps from the on-end production to the online service and finally (real-time/offline) forwarding to the downstream:

  • Data can be used in different ways, including the following types:

  • real-time

  • Quasi-real-time flow (message queue) : used for downstream data analysis, features: high (min) timeliness, requires strict data accuracy. Typical applications: R&D platform, Trace platform;

  • Pure real-time stream (RPC proxy) : Used by downstream policies. Features: Second-level timeliness, allowing a certain degree of data loss. Typical application: Recommended architecture.

  • Offline: offline large table, complete set of all logs, features: day/hour timeliness, strict data accuracy.

  • Others: Need certain timeliness and accuracy


2.2 Problems

From the above log architecture, there are the following problems:

  • Giant module: The Dot Server carries all the data processing logic, with serious functional coupling:

  • Multiple functions: access & persistence, business logic processing, various types of forwarding (RPC, message queue, PB drop disk);

  • Fan out multiple: More than 10 fan out flows are forwarded by the server.

  • Directly interworking with message queues: From the perspective of services, messages in the sending message queues may be lost, and the requirements of neither resetting nor discarding services cannot be met.

  • Services without hierarchy:

  • Core business and non-core business architecture deployment are coupled

  • Mutual iteration, mutual influence

Three, not lose, not lose

3.1 Theoretical basis of non-loss of data

3.1.1 Only 2 Lost Data theory

  • End: Due to the impact of the mobile terminal environment, such as a blank screen, blinking back, failure to stay in the process, and uncertain start period, the client message may be lost at a certain probability

  • Access layer: Due to the possibility of inevitable server failures (service restart, server failure), there is a certain probability of data loss

  • Computing layer: after the access point, based on the streaming framework, the construction needs to ensure that the data is not lost in a strict sense.

3.1.2 Log Architecture optimization Direction

Data access level:

  • Persist data first, process business later

  • Reduce logic complexity

Downstream forwarding layer:

  • Real-time stream classes: strictly not lost

  • High aging class: ensure data aging, allowing possible partial loss

  • Resource isolation: The deployment of different services is physically isolated to prevent services from affecting each other

  • Priority: Identifies different types of data based on service requirements


3.2 Architecture Disassembling

Based on the analysis of the current situation of the log center, combined with the only 2 theory of log service, we disassemble and reconstruct the existing architecture for the log center.

3.2.1 Disassembling the Server service (Optimizing data Loss at the Access Layer)

Based on the above no-re-lose theory, the log access layer is constructed in the following aspects to ensure that data is not re-lost as much as possible.

  • Log priority persistence: Minimizes data loss of access points due to server failures.

  • Huge service disassembly: The access point should be constructed in a simple and lightweight way to avoid service stability problems caused by too many business attributes;

  • Flexibility & Ease of use: Design a reasonable streaming computing architecture based on the characteristics of business requirements without losing anything.

3.2.1.1 Log Persistence Priority

Existing fan out data in the log console must be persisted first, which is a basic requirement of the log access layer. In real-time streaming, data should be “as close as possible” while the service data forwarding delay is guaranteed at the minute level.

  1. Persistence: The access layer prioritizes data persistence before actual service processing, and ensures data loss “as far as possible”.

  2. Real-time flow: Avoid direct connection with message queues. Drop disk + Minos forwarding message queues are preferred to ensure data loss when the delay is at most minutes.

3.2.1.2 Giant Service Disassembly & Function sinking

To reduce the stability risk caused by excessive function iteration of the log service, flexible subscription requirements of downstream services must be met, and the rationality of fan out in logs must be ensured. We further disassemble online services:

  • Real-time flow service: Message flow goes through access layer → Fan out layer → service layer → service layer.

  • Access layer: single function, the design goal is not to lose data as much as possible, to ensure the data in the first time persistence;

  • Fan out layer: ensure flexible subscription method for downstream, and conduct data splitting & reorganization (currently based on fan out dimension of DOT ID);

  • Business layer: combines and subscribes fan out layer data, completes the realization of business’s own requirements, and is responsible for the production and forwarding of data to the downstream;

  • High aging business:

  • The services of policy real-time recommendation are separated from the services separately to support data forwarding of RPC class, ensuring ultra-high timeliness and ensuring that the SLA of data forwarding reaches 99.95% or more;

  • Other business:

  • Data monitoring, VIP, gray scale and other businesses require further reduction of timeliness and loss rate, and can be separated from these businesses to separate services;

  • Technology selection: In view of the characteristics of data streaming computing, we choose streamCompute architecture to ensure that the whole process is “neither lost nor lost” after data passes through the access layer.

Therefore, the logging service is further disassembled, as shown below:

3.2.1.3 Thinking about streaming computing

To ensure strict data flow stability, it is necessary to rely on the streaming computing architecture to ensure that data is neither lost nor redistributed during service computing and meet the requirements of obtaining data in different service scenarios. In view of the characteristics of log center, we design the streaming computing processing architecture as follows:

  • Server: To fan the real-time flow through the message queue to the streaming framework (shunt flow inlet)

  • Split flow: Outputs information of different points to different message queues based on traffic volume. The advantage of doing this is to take into account the data flexible fan out requirements, the downstream can be flexible subscription;

  • QPS is less than some threshold or horizontal point, etc. : single message queue output, flexible fan out;

  • QPS smaller points, aggregation output to aggregation queue to save resources;

  • Business Flow: If the business has its own data flow processing requirements, jobs can be deployed separately so that the resources of individual jobs are isolated.

  • Input: combine and subscribe different fan-out data of shunt flow for data calculation;

  • Output: After mixed calculation, the data is output to the business message queue, and the business party subscribes to process by itself;

  • Service filter: As the ultimate operator sent to the service layer, responsible for each data flow at the end-to-end layer. (the system retry data of the server and the end) is not heavy. The log server generates a unique identifier (similar to md5) for each log data, and the service Flow filter operator performs global deduplication.

3.2.2 Optimization of SDK Data reporting (to solve data loss reported by the terminal)

Data loss may occur due to the environment of the client. Especially when the concurrency of the call is high, the data cannot be sent to the server at the first time. Therefore, the client needs to temporarily store service data in the local database and asynchronously send messages to the server to ensure asynchronous sending and local persistence. However, APP can be quit or uninstalled in any case, so the longer the data stays locally, the less value it has for business data and the easier it is to lose. Therefore, we need to optimize data reporting in the following directions:

  • Increase the reporting time: The reporting time of cached data is improved and the local cache time of messages is minimized, for example, the polling time of scheduled tasks is improved, the service is triggered when the cached data is carried, and the cache message is triggered when the cache message reaches the threshold.

  • Increase the number of reported messages: Adjust the number of reported messages to obtain a reasonable number of reported messages after ensuring the size and number of reported data (threshold obtained by experimental comparison), so as to achieve the maximum benefit when the messages reach the server as soon as possible.

Through the client side send logic is constantly optimized, also achieved great benefits in timeliness. Convergence rate increased by 2%+ at both ends.

Four, outlook

The previous paper introduces some efforts of log service in data accuracy assurance. Of course, we will continue to dig into the risk points of the system, such as:

  • Data loss due to disk faults: The access layer establishes a strict data loss basis based on the data persistence capability of the enterprise

I hope that the log center continues to optimize and contribute to the accurate use of dot data for business.

Recommended reading:

Input visualization methods in visual Transformer

In-depth understanding of WKWebView (Render) – DOM tree construction

In-depth understanding of WKWebView (Introduction) – WebKit source debugging and analysis

GDP Streaming RPC design

Baidu APP video decoding optimization