This article is compiled from GOPS2017 speech “Design and Practice of wechat Mass Data Monitoring” at Shanghai Station.

Author’s brief introduction

Chen Xiaopeng joined Tencent in 2008 and was transferred to the wechat operation and maintenance Development team in 2012 to take charge of the transformation of the operation and maintenance monitoring system. He is the main designer and developer of the current operation and maintenance monitoring system of wechat.

preface

This paper shares the specific design practice of wechat operation and maintenance monitoring system. Before sharing, take a look at the status quo of wechat background system in the figure below. Facing the huge amount of calls and complex call links, it is difficult to maintain by manpower alone, so we can only rely on a comprehensive monitoring, stable and fast operation and maintenance monitoring system.

Our operation and maintenance monitoring system mainly has three functions:

  • The first is the fault alarm;

  • The second is fault analysis and location;

  • The third is the automation strategy.

Today we share the theme, mainly includes the following three parts:

The first is the lightweight monitoring data collection;

The second is the development process of wechat data monitoring;

The third mass monitoring analysis of data storage design ideas.

1. Lightweight monitoring data collection

Let’s take a look at the common data collection process. In general, the common data collection process is to collect data from logs, and then collect data locally, and then send it to the global server.

However, for wechat, 200w/min adjustment volume is generated by 200 billion /min monitoring data reporting, which may be a relatively conservative estimate.

In the early days, we used custom text log reporting, but due to a large number of services and background services, the log format grows rapidly, and it is difficult to maintain continuously. In addition, CPU, network, storage, and statistics are under great pressure, which makes it difficult to ensure the stability of the monitoring system.

In order to achieve stable minute-level and even second-level data monitoring, we made a series of modifications.

For our internal monitoring data processing is divided into two steps:

  • The first is data classification

  • The second is customizing the processing strategy

We classify data, and internally we have three types of data:

The first is real-time fault monitoring and analysis;

The second is non-real-time data statistics, such as business reports, etc.

The third is single-user exception analysis, for example, a user report to the user to separate the fault analysis.

The following briefly introduces non-real-time data statistics and single-user exception analysis, and then focuses on real-time monitoring data processing.

1.1. Non-real-time data

For non-real-time data, we have a configuration management page.

The user applies for the logid + user-defined data field before the report. Instead of writing log files, the report uses shared memory queues and batch packaging to reduce disk I/OS and log server call pressure. The use of distributed statistics is now standard practice.

1.2. Single user exception analysis

For single user exception analysis, we focus on exceptions, so the reported path is similar to the non-real-time path just described.

With fixed format: LOGID + fixed data field (server IP+ return code, etc.), the amount of data reported is much larger than the non-real-time log, so we report it by sampling. In addition to storing the data in Tdw distributed storage, we also forward it to another cache for a query cache.

1.3. Real-time monitoring data

Real-time monitoring data is the most important part to share, and it is also the majority of the 200 billion /min logs reported.

In order to realize the monitoring points of bearing, we also there are many kinds of real-time monitoring data type, the format, source, statistic way is different, in order to realize the rapid and stable data monitoring, we has carried on the classification to the data, and then pointed to simplify, unified data format for all kinds of data, and to simplify the data after adopting the optimal data processing strategy.

For our data, we think there are the following:

  • Background data monitoring, used for wechat background service monitoring data;

  • Terminal data monitoring, in addition to the background, we also need to pay attention to terminal specific performance, abnormal monitoring and network anomalies;

  • External monitoring service. Now we have services provided by external developers such as merchants and small programs. Both we and external service developers need to know what kind of anomalies exist between this service and our wechat, so we also provide external monitoring service.

1.3.1 Background data monitoring

For our background data monitoring, we think it can be divided into four categories according to the level, each with different formats and reporting methods:

1. Hardware level monitoring, such as server load, CPU, memory, IO, network traffic, etc.

2. The running status of the process, such as memory, CPU, IO, etc.

3. Call chain between modules, call information between modules and machines, is one of the key data for fault location.

4. Business indicators and data monitoring at the overall level of business.

Data of different types is simplified to the following format for easy data processing.

The bottom two layers use the IP+Key format. Later, when the container is introduced, it uses the ContainerID, IP, and Key format.

The module call information extracts the overall information of the module to share the data format of ID and Key with business indicators.

Let’s focus on IDKey data. Earlier this IDKey data is the focus of the monitoring data, but the reported amount accounted for more than 9 into the data report, like what you just said, in a text type report data is difficult to achieve stable, fast, so we made a very simplified, fast way of reporting, directly in the memory for quick summary, report the specific solution we can see the diagram below.

Uint32_t [MAX_ID][MAX_KEY] The reason for two pieces is convenient for periodic data collection (6s collection once).

We only allow three internal reporting methods: add, set new value, and set maximum value. All of these three methods operate a Uint32_t with very small performance consumption and one of the biggest advantages is that they summarize in memory in real time. Only about 1000 records can be extracted from memory each time, greatly reducing the difficulty of second-level statistics.

Another important data in the background data is the call relation data, which plays a very important role in fault analysis and location.

The preceding format is used to locate fault points (machines, processes, interfaces) and impact surfaces. It is the second largest amount of data reported less than IDKey. Each background call generates one data, so it is still difficult to process using logging.

We use another shared memory statistics method similar to IDKey in the service. For example, a service has N workers, each Worker will allocate two small shared memory for reporting, and then the collecting thread will package the data and send it externally.

This report is carried out by the framework layer, and the service developer does not need to manually add the report code (99% of wechat uses the internally developed service framework).

1.3.2 Terminal data monitoring

After introducing background data, let’s talk about terminal monitoring data. What we are concerned about is the specific performance and abnormality of wechat APP on mobile phone, the time consuming and abnormality of calling wechat background, as well as the network abnormality.

The log data generated by the mobile terminal is very large. If the log data is reported in full, there will be a lot of pressure on the terminal and the background. Therefore, we did not report the log data in full.

We have different sampling configurations for different data and terminal versions, and the background periodically delivers sampling policies to terminals.

The terminal does not send data sampling and reporting in real time. Instead, it uses temporary storage to record data and then sends the data after a period of time to minimize the impact on the terminal.

1.3.3 External monitoring services

The following is a brief introduction of our latest external monitoring service. This solution refers to some cloud monitoring solutions. Users can configure dimension information and monitoring rules by themselves.

At present, this function has been developed in the page of small program developer tool on our merchant management interface. However, custom reporting is not open yet, and only some fixed data items collected in the background are provided.

Ii. Development process of wechat data monitoring

Above, the way of data reporting is introduced. Next, how to monitor the data is introduced.

2.1 anomaly detection

First of all, three methods may be used for general anomaly detection:

  • The first one is the threshold, which is very different even in the morning and at night, and the threshold itself is not distinguishable, so this is only applicable to a few scenarios for us;

  • The second one is year-on-year. The problem is that our data are not the same at the same time every day. There will be a big difference from Monday to Saturday, so the accuracy can only be guaranteed by reducing sensitivity.

  • The third one is sequential. In our data, adjacent data do not change smoothly, especially when the order of magnitude is small. Similarly, only reducing sensitivity can ensure accuracy.

Therefore, these three common data processing methods are not very suitable for our scenario, and we have improved the algorithm in the past.

The first improved algorithm we used is mean square deviation, which is to take the data at the same time every day in the past month to calculate the mean value and mean square deviation, and use multi-day data to adapt to the data jitter.

This algorithm is applicable to a wide range, but for curves with large fluctuations, the sensitivity is low and it is easy to be missed.

The second algorithm we improved is polynomial fitting prediction, which is suitable for smooth curves, kind of like improved sequential.

However, if the abnormal data is stable growth or decrease, and there is no mutation, it will also be judged as normal, and there is under-reporting.

Therefore, although the above two algorithms have a lot of improvement over the previous algorithm, there are also some defects. We are currently experimenting with other algorithms, or combinations of them.

2.2 monitoring configuration

In addition to the algorithm itself, we also have problems in the configuration of monitoring items. Because we have a large number of services, the monitoring items exceeding 300,000 May have to be manually configured. Different algorithms and sensitivities are selected to observe the curve of each configuration. So it’s not sustainable.

At present, we are trying to automatically configure monitoring items, such as using historical data, historical abnormal samples, extracting features, classifying data, and then automatically applying the optimal monitoring parameters. It’s something we’re trying to do with some success, but it’s not perfect, it’s still working on.

3. Design ideas of data storage under mass monitoring analysis

Above, I shared how to collect and monitor data, and finally introduced how data is stored.

Data storage is also important for us, as just mentioned per minute monitoring data out to take a month, and our failure analysis, for example, a module of abnormal need to load all the machines call information, CPU, memory, network, all kinds of process information, etc., if the machine much more special, a read the amount of data that will be more than 50 w * 2 days.

Therefore, we have very high requirements for read and write performance of monitoring data stores.

First of all, the basic requirement of write performance is that the total input volume may be more than 200 million in a minute, and the data volume can be at least 500W /min in a single machine. The data read performance supports 50w data reads per minute x 22 days.

In terms of data structure, we have multiple dimensions of all kinds of data. For example, the dimensions of call relationship are very large, and it also supports partial matching queries of different dimensions, such as client side, SVR side, module level, host level, etc., rather than simple key — value queries.

Note that our multidimensional key is divided into main key and sub key, and we’ll explain why we do this later.

We used to refer to other open source solutions when monitoring the transformation of data storage, but at that time, we did not find a ready-made solution that fully met the requirements of performance and data structure, so we developed our own time series server.

First, for data writing, if one record per minute, the amount of data is too large. Therefore, we will first cache the data for a certain period of time, and then batch merge it into one record a day. This is a common way to improve write performance. Our data cache time is one hour.

The key point of our self-developed key-value storage is the implementation of key. First, the key is resident in memory. In addition, because of the large amount of data, it is impossible for one machine to support, so we use a multi-machine cluster and use hash(main_key) to write and query data.

Partial matching query is implemented by modified binary search method. The query performance achieved in this way is very high, can exceed 100W /s, and add a query result cache performance is even higher.

However, it also has some problems, such as unbalanced hash(main_key) data, and one record per day, so the key occupies too much memory.

Because of the above problems, we made a second improvement.

The second improvement method is to split key-value into key-id-value and control Value data balancing through ID allocation service. Key-ids are reassigned once every seven days to reduce memory usage.

The biggest problem for storage is disaster recovery (Dr). Since servers are monitored, their Dr Capabilities are also very high.

Generally speaking, it is difficult to achieve high DISASTER recovery and strong data consistency. However, the wechat background has opened source the self-developed PHxPaxOS protocol framework, which can easily achieve data disaster recovery.

In addition, the multi-master feature of PHxPAxOS framework can improve concurrent read performance.

Read more about it

Develop an enterprise-level monitoring platform in Python

Automatic train ticket snatching with Python code

How does the monitoring platform of ten billion visits become?

Second monitoring on the order of trillions of transactions

Jd large-scale data center network operation and maintenance monitoring eye

Salvation of IT Operation and Maintenance — the ideal practice of SF Express operation and maintenance

Tencent’s massive business intelligence monitoring practice in the ERA of AI

How to realize multi-dimensional intelligent monitoring? — Practical exploration of AI operation and maintenance

What does a Silicon Valley unicorn surveillance system look like?

What are the innovative applications of Tencent’s social network monitoring data?

Come to GOPS2018 shenzhen Station – Tencent Operation and Maintenance System special performance

Teacher Wu Shusheng will give a wonderful speech: Innovative Application of Monitoring Data

Click to read the original article to enter the official website of GOPS2018 shenzhen Station Conference