Operation and maintenance monitoring done like this, BATJ level

A, leads

We know that the goal of the monitoring system is to help us understand the operation status of the business system in a more comprehensive and detailed way and discover system risks in a more timely manner in order to guarantee the business SLA, and at the same time to gain more time to defuse risks and direction to solve problems for technical operation students. Some use open source monitoring systems (e.g. Nagios, Zabbix, Prometheus, Grafana, etc.) for this purpose, and others use self-developed monitoring systems (e.g. Xiaomi’s Open Falcon, Tencent’s internal monitoring system tnM2 [basic monitoring], CMS [log monitoring], etc.

As business systems rely on monitoring systems, we have higher requirements for high availability, scalability and other capabilities of monitoring systems. How should we comprehensively and systematically view and improve the requirements of our own monitoring systems?

2. Ability improvement methods

How to comprehensively and systematically look at the requirements of the monitoring system, there are many ways, when facing problems, compare with some top companies’ excellent monitoring system, find out the improvement point. It can also be compared with the “monitoring and management” capability assessment content in “DevOps Capability Maturity Model” led by China Academy of Information and Communication Technology and participated by BATJ and other Internet giants. According to the standard evaluation content, we can see how BATJ defines the capability of an advanced monitoring system. Let’s take a look below:

Iii. Improvement points: Ability items

We found that the whole ability item of “monitoring management” was divided into three ability items, namely “monitoring and collection”, “data management” and “data application”. The ability item also included related sub-ability items. I selected some representative points for analysis in my opinion:

A) [Ability item 1: Monitoring and collection]

1. Ability point: “Support to provide open and customized data content collection and reporting scheme”

Question: Why is there a requirement for reporting?

Interpretation: For example, CMS, Tencent’s internal log monitoring system, has a variety of collection schemes “Agent, SDK, Kafka, ES, etc.”, and different collection schemes should deal with different scenarios of Agents: Similar to fileBeat, the inode node of a file is monitored by specifying a server path and reports new data immediately.

SDK: code can be embedded into the business logic, and deal with some sensitive data do not fall to the ground but need to report the scene again, can be in the business logic to desensitization (dyeing) of sensitive data, and then to carry on the report, also can deal with some log volume is too big, don’t want to pass log trading in the link of the intermediate consumption performance scene; For example, in the financial transaction scenario, transaction data should be monitored, but some sensitive data do not want to enter the monitoring system. At this time, SDK should be used for desensitization during log generation, user information should be hidden, and then reported to the monitoring system.

Kafka: It can deal with the scenario of multiple consumers for one log. After businesses put logs into Kafka, multiple consumers can extract them by themselves. For example: or financial transaction scenario, a log can do security audit, but also can do monitoring system, at this time can be security audit system and monitoring system at the same time pull a Kafka subject data, do not print multiple copies;

2. Capability point: “Support multiple transmission schemes, such as push and pull data”

Question: Why do you need push and pull data? Have a kind of can’t?

Interpretation: Normal monitoring system is generally used to pull data scheme, because the server initiated, sequence and process controllable, but why need to pull data? The reason is that this capability is required in several scenarios: Network restriction. When network restriction occurs, for example, as stipulated in security policies, a region with a high security level can initiate a link to a region with a low security level, but not vice versa. Therefore, data should be pushed from a region with a high security level to the monitoring service. Performance requirements, like Zabbix’s active and Passive modes; Service features: Some services do not provide external request interfaces, so internal logic is required to actively Push monitoring data externally. To ensure comprehensive monitoring of business systems and processes, we need to satisfy multiple capabilities;

For example, a scheduled task in a service collects offline data and updates it to the database. This scheduled task does not have any request access interface. How can we monitor its running status? A heartbeat mechanism can be added to the timing task logic to periodically push its monitoring status to the monitoring system, so the transmission capacity of push is also essential for monitoring;

B) [Ability Item 2: Data management]

1. Ability point: “Ability to regularize original data”

Question: why is it necessary to have regular processing when receiving data, but not after landing?

Reading: Based on considerations of performance efficiency and data integrity, need to have this ability, in the receiving process we still to tencent inquiry log monitoring system as an example, when we receive a lot of log report Agent, may log is not necessarily the report according to our rules, if once you have log format error, will cause a lot of incoming data is unusual, It can also lead to data contamination, which requires a rule-based processing capability to clean data that does not meet the rules. At the same time, if a large number of abnormal logs, cleaning and processing after landing will consume a lot of computing power, for the later is also a lot of pressure, so it is very necessary to have this ability.

2. Ability point: “Ability of association analysis and processing of heterogeneous data sources”

Question: What is the specific capability of association analysis of heterogeneous data sources?

Reading: Heterogeneous data sources broadly refers to the “data structure, access mode, form different multiple data sources”, we are still in tencent internal since research journal CMS monitoring system as an example, when a service report log with the source IP address and business critical data, we can know which simple row of heavy and sorting the source IP address access to the most, However, if we want to know a certain city, province or even operator (China Telecom, China Mobile and China Unicom), we need this association analysis capability. We know that there is a kind of data that IP address corresponds to city, province and operator (which needs to be maintained independently due to constant update). By associating this data with the log data, we can clearly see the result we want;

3. Capability point: “Management features such as data consistency, integrity and availability”

Question: Data consistency, integrity, and availability are easy to understand, but what are the management features?

Reading: we are still in tencent internal since research journal CMS monitoring and control system as an example, the log monitoring system consists of user data reporting, data format, processing, polymerization (statistics, dimension analysis), warehousing/delivery, write sequence database of multiple links such as abnormal when the user to see the final result how can quickly know what went wrong? This requires related management features to achieve, in each link to increase the ability of self-monitoring, clearly see the data flow and graph, can quickly find abnormal points;

C) [Ability item 3: Data application]

1. Capability: Alarm storm management and control, such as suppression and convergence

Question: What are the common methods of alarm convergence?

Interpretation: Common alarm convergence rules include Time-based convergence, event-based convergence, and Severity-based convergence. Different convergence modes can be used according to different service requirements. Time-based is the most commonly used, Nagios and Zabbix base configuration. Event-based alarms typically require active and passive invocation relationships, such as Zabbix’s trigger-Dependencies feature. Level-based convergence is used in both open source and proprietary systems.

Fourth, the end

How to treat and improve the ability of monitoring and control system, whether it is comparative study with reference to the open source monitoring system, or from the integration of research and development operations (enterprise) capability maturity model (CMM) in the comparative study, is a good direction, of course the knowledge point is set inside the majority of Daniel’s wisdom crystallization, this paper just won a small amount of points.

The public account “Tencent Cloud Monitoring” will continue to provide best practices and related articles in the monitoring field, welcome your continued attention. Scan the QR code, pay attention to Tencent cloud monitoring public number.

Operation and maintenance monitoring done like this, BATJ level

A, leads

2. Ability improvement methods

Iii. Improvement points: Ability items

Fourth, the end

Related Posts

Comics: What is a string matching algorithm?

WebRTC Weekly 372

【TcaplusDB knowledge base 】TcaplusDB architecture introduction