I. Business background

In the era of information explosion, information flows freely around the world with the help of the Internet, resulting in a variety of platform systems and software systems. More and more businesses will also lead to system complexity.

When a core service problem affects user experience, the developer fails to discover the problem in time and it is already too late. Or when the CPU of the server continues to increase and the disk space is full, o&M personnel need to find and deal with the problem in time. In this case, an effective monitoring system is required to monitor and warn the problem.

How to monitor and maintain these services and servers is an important part that developers and operation and maintenance personnel cannot ignore. This article is about more than 5000 words. I will systematically sort out the principle of Vivo server monitoring and the evolution of the architecture, so that we can refer to the selection of monitoring technology.

Vivo server monitoring is designed to provide server application including monitoring system, monitoring the JVM as well as one-stop custom business indicators monitoring, data monitoring, and form a complete set of real-time, multidimensional and multi-channel alarm service, help users to timely grasp the application of various state, timely warning found fault, provide detailed data for tracing positioning problem, Improve service availability. At present, the total number of vivo server monitoring access business parties has reached 200+. This paper introduces server monitoring, and our company also has other types of excellent monitoring, including general monitoring, call chain monitoring and client monitoring.

1.1 Basic process of monitoring system

Whether open source monitoring system or self-developed monitoring system, the overall process is much the same.

1) Data collection: can include JVM monitoring data such as GC times, thread number, old age and new generation area size; The system monitors data such as disk usage, disk read and write throughput, network egress and inbound traffic, and TCP connections. Service monitoring data such as error logs, access logs, video playback volume, PV, UV, etc.

2) Data transmission: The collected data will be reported to the monitoring system in the form of message or HTTP protocol.

3) Data storage: RDBMS such as MySQL and Oracle are used for storage, timing databases OpenTSDB and InfluxDB are also used for storage, and HBase is directly used for storage.

4) Data visualization: graphical display of data indicators, including line charts, bar charts, pie charts, etc.

5) Alarm monitoring: Flexible alarm Settings, and support email, SMS, IM and other notification channels.

1.2 How to use the monitoring system in a standard way

Before using a monitoring system, we need to understand the basic working principle of monitoring objects, such as JVM monitoring, we need to know the JVM memory structure composition and common garbage collection mechanism; Secondly, you need to determine how to describe and define the status of monitoring objects. For example, to monitor the interface performance of a service function, you can monitor the amount of requests, time consumption, and errors of the interface. After determining how to monitor the object status, you need to define a proper alarm threshold and alarm type to help you discover faults in a timely manner when receiving alarm notifications. Finally, establish a perfect fault handling system, respond quickly when receiving alarms, and deal with online faults in time.

Ii. Architecture and evolution of Vivo server monitoring system

Before introducing the architecture of Vivo server monitoring system, we will take you to understand the timing database of OpenTSDB first. Before understanding it, we will explain why we choose OpenTSDB, for the following reasons:

1) The monitoring data collection index has a unique value at a certain point in time, without complex structure and relationship.

2) The indicators of monitoring data are characterized by constant changes over time.

3) Based on HBase distributed and scalable time series database, the storage layer does not need too much effort, and has HBase high throughput and good scalability.

4) Open source, Java implementation, and provide HTTP-based application programming interface, troubleshooting problems can be quickly modified.

2.1 OpenTSDB profile

1) HBase distributed and scalable time series database, mainly used as monitoring system. For example, it collects and stores monitoring data of large-scale clusters (including network devices, operating systems, and applications), supports second-level data collection, permanent storage, capacity planning, and easy access to existing monitoring systems. The system architecture of OpenTSDB is as follows:

(From official documentation)

The storage unit is Data Point, that is, the value of a Metric ata certain Point in time. Data Point consists of the following parts:

  • Metric, monitoring indicator name;

  • Tags, Metric, used to tag information such as machine names, including TagKey and TagValue.

  • Value, Metric The actual Value, integer or decimal;

  • Timestamp indicates the Timestamp.

The core stores two tables: TSDB and tsDB-UID. Table TSDB is used to store monitoring data, as shown below:

(Photo credit: www.jianshu.com)

The Row Key is Metric+Timestamp hour +TagKey+TagValue, and the corresponding byte mapping is combined. Qualifier in column family T is the number of seconds remaining from the hour of Timestamp, and the corresponding Value is Value.

The table TSDB-UID is used to store the byte mappings just mentioned, as shown below:

(Photo credit: www.jianshu.com)

001 in the figure indicates that tagK = HOTS or tagV =static, providing positive and negative query.

2) OpenTSDB Usage policy Description:

  • Instead of using the REST interface provided by OpenTSDB, the client is directly connected to HBase.

  • Disable the Thrd thread for compact action on the project side;

  • Redis buffer data is obtained every 10 seconds and written to OpenTSDB in batches.

2.2 Points of concern for OpenTSDB in practice

1) Accuracy

String value = "0.51";
float f = Float.parseFloat(value);
int raw = Float.floatToRawIntBits(f);
byte[] float_bytes = Bytes.fromInt(raw);
int raw_back = Bytes.getInt(float_bytes, 0);
double decode = Float.intBitsToFloat(raw_back);
* Parsed Float: 0.51 * Encode Raw: 1057132380 * Encode Bytes: 3F028F5C * Decode Raw: 1057132380 * Decoded Float: 0.5099999904632568 */
System.out.println("Parsed Float: " + f);
System.out.println("Encode Raw: " + raw);
System.out.println("Encode Bytes: " + UniqueId.uidToString(float_bytes));
System.out.println("Decode Raw: " + raw_back);
System.out.println("Decoded Float: " + decode);
Copy the code

As shown in the above code, OpenTSDB cannot know the storage intention when storing floating point data, and will encounter accuracy problems when converting. That is, “0.51” is stored and “0.5099999904632568” is retrieved.

2) Aggregation function problem

Most of the aggregation functions of OpenTSDB, including sum, AVG, Max and min, are linear Interpolation of LERP (Linear Interpolation). That is to say, the value obtained has the phenomenon of being filled, which is very unfriendly to the use of empty value requirement. Refer to the OpenTSDB documentation on Interpolation for details.

At present, OpenTSDB used for vMonitor server monitoring is the source code after our transformation. Nimavg function is added to meet the requirement of null value insertion with its own zimsum function.

2.3 Mechanism of vivo Server Monitoring Collector

1) Timer

There are three types of collectors: THE OS collector, THE JVM collector, and the service indicator collector. The OS and THE JVM collect and aggregate data every minute. The service indicator collector collects data in real time and resets the data within one minute.

2) Service indicator collector

There are two ways to collect service indicators: Log output filtering and tool class code reporting (intrusive). Log output filtering obtains renderedMessage output by Appender specified in indicator configuration by inheriting log4J Filter, and synchronism listens and collects according to key words and aggregation mode of indicator configuration. Code report Message information is reported according to the indicator code specified in the code, which is an intrusive collection method and is realized by calling Util provided by monitoring. Service indicator configurations are refreshed from the CDN every five minutes. Various built-in aggregators are used for aggregation, including count, sum, average, Max, and min statistics.

2.4 Vivo server monitoring old version architecture design

1) Data collection and reporting: VMonitor-Agent collects data based on the indicator configuration and reports the data to RabbitMQ once every minute. The indicator configuration is downloaded and updated from the CDN every 5 minutes, and the CDN content is uploaded by the monitoring background.

2) Calculation and storage: The monitoring background receives RabbitMQ data, disintegrates it and stores it to OpenTSDB for visual chart call. The configuration of monitoring items, applications, indicators and alarms is stored in MySQL. The distributed task distribution module is realized by Zookeeper and Redis to realize the coordination and cooperation of multiple monitoring services for distributed computing.

3) Alarm detection: Obtain monitoring indicator data from OpenTSDB, detect anomalies according to alarm configurations, and send anomalies through third-party self-research messages and short messages. Alarm detection completes distributed computing through the distributed task distribution module.

2.5 Vivo server monitoring the old deployment architecture

1) Self-built machine room A: In China, the monitoring project is deployed in the self-built equipment room A. The monitoring project listens to RabbitMQ messages in the equipment room. The dependent Redis, OpenTSDB, MySQL, and Zookeeper are all in the same equipment room. The configuration of monitoring indicators to be uploaded is uploaded to the CDN by the file service for monitoring applications.

2) Cloud machine room: The application device reports the monitoring data to the local RabbitMQ in the cloud machine room, and the RabbitMQ in the cloud machine room forwards the specified queue to the RabbitMQ in the self-built machine room A. The monitoring configuration of the cloud machine room is pulled out through the CDN.

2.6 Architecture design of the new version of Vivo server monitoring

1) Collection (from the access party) : The service party connects to the VMonitor-Collector and configures monitoring items on the monitoring background of the corresponding environment. The VMonitor-Collector periodically pulls the configuration of monitoring items, collects service data, and reports the data every minute.

2) Data aggregation: In the old version, RabbitMQ routes the collected data to RabbitMQ in the monitoring machine room (this behavior does not occur in the same machine room) for consumption by the monitoring background service. The CDN is responsible for carrying the timed pull of the configuration supply for each application. The new vmonitor-gateway, as the monitoring data gateway, uses HTTP to report monitoring data and pull indicator configurations, abandoning the RabbitMQ report and CDN configuration synchronization methods to avoid the impact on the monitoring report when the two faults occur.

3) Visualization and support for alarm and configuration (monitoring background VMonitor) : Responsible for the diversified data display of the front desk (including business index data, summary data of extension room, data of single server, and composite operation presentation of business index), data aggregation, and alarm (currently including SMS and self-research messages).

4) Data storage: HBASE cluster and open source OpenTSDB are used as the mediation of aggregation. After the original data is reported, OpenTSDB is used to persist the data to HBASE cluster. Redis is used as the information of scheduling task allocation and alarm status of distributed data storage.

Monitoring Collect, report, and store monitoring data policies

To reduce the cost of monitoring access and avoid the impact of RabbitMQ reporting failures and CDN synchronization configuration failures on the monitoring system, the collection layer directly reports data to the proxy layer through HTTP, and the queue between the collection layer and the data proxy layer achieves the maximum recovery of disaster data.

The detailed process is described as follows:

1) The vmonitor-Collector collects and compresses data every minute according to the monitoring configuration and stores the data in the local queue (the maximum length is 100, that is, the maximum length is 100 minutes). The notification can be reported through HTTP and the data can be reported to the gateway.

2) After receiving the reported data, the gateway (vmonitor-gateway) authenticates the received data and discards the received data. At the same time, judge whether the current lower layer is abnormal fuse, if so, inform the collection layer to reset the data and return to the queue.

** 3) The version number of monitoring configuration brought by gateway verification **. If the monitoring configuration expires, the latest monitoring configuration will be returned to the collection layer to update the configuration when the result is returned.

4) The gateway stores the reported data in the Redis queue corresponding to the application (the maximum length of a single application cache queue key is 1W). After the storage queue is complete, HTTP is returned to the gateway immediately, indicating that the data has been received. The collection layer can remove the data.

5) The gateway decompresses and aggregates Redis queue data; If the fuse is abnormal, suspend the previous action; After completion, it is stored to OpenTSDB through HTTP. If the storage behavior is abnormal, the fuse is triggered.

4. Core indicators

4.1 System Monitoring Alarms and Service monitoring alarms

After the collected data is stored in HBase using OpenTSDB, the distributed task distribution module is used to perform distributed computing. If the alarm rules configured on the service side are met, corresponding alarms are generated, the alarm information is grouped, and the route is sent to the correct notifying party. Alarms can be sent through SMS self-research messages, and the personnel who need to receive alarms can be queried and recorded by name, id and pinyin. When a large number of repeated alarms are received, the repeated alarm information can be eliminated. All the alarm information can be recorded through MySQL table for subsequent query and statistics. The purpose of alarms is not only to help developers discover faults in a timely manner and establish a fault emergency mechanism, but also to learn from industry best monitoring practices by combining the monitoring items and alarm carding services with service characteristics. The alarm flow chart is as follows:

4.2 Supported alarm types and calculation formulas

1) Maximum value: when the specified field exceeds this value, the alarm will be triggered (alarm threshold unit: number).

2) Minimum value: when the specified field is lower than this value, the alarm will be triggered (alarm threshold unit: number).

3) amount of volatility: take the time to 15 minutes before the time of the maximum or minimum value within 15 minutes with the percentage of average do floating alarm, volatility quantity need to configure the baseline, identified more than the baseline value when doing “alarm threshold, below the baseline values do not trigger the alarm (alarm threshold unit: percent).

Calculation formula:

Float rate = (float) (max-avg)/(float) avg; float rate = (float) (max-avg)/(float) avg;

Float rate = (float) (avg-min)/(float) avg; float rate = (float) (avg-min)/(float) avg;

Float rate = (float) (max-min)/(float) Max;

4) Sequential daily: take the value of the current time and the same time yesterday as the floating percentage alarm (alarm threshold unit: percent).

Float rate = (current value – last period value)/last period value

5) Week-to-week: Take the value of the current time and the same time of the same day last week as the floating percentage alarm (alarm threshold unit: percent).

Float rate = (current value – last period value)/last period value

6) Hourly daily sequential: Take the sum of data values from the current time to the previous hour and the sum of data values from the previous hour to the same time yesterday as a floating percentage alarm (alarm threshold unit: percent).

Calculation formula: float rate = (float) (anHourTodaySum – anHourYesterdaySum)/(float) anHourYesterdaySum.

Five, the demonstration effect

5.1 Querying Service Indicator Data

1) In the query condition column, select the specified indicator from “Indicator”.

2) Double-click the indicator name on the chart to display the larger picture. The total value of the indicator domain based on the start time is displayed at the bottom.

3) The scroll wheel can scale the chart.

5.2 System Monitoring &JVM monitoring indicator data Query

1) The page is refreshed automatically every minute.

2) If a certain line, that is, the whole line of a certain machine shows red, it means that the machine has not reported data for more than half an hour. If the machine is abnormally offline, attention should be paid to the investigation.

3) Click the details button to query the system &JVM monitoring data in detail.

5.3 Configuring Service Indicators

A single monitoring indicator (common) can collect data from a single log file of a specified Appender.

[Required] [Indicator type] can be Common or composite. Composite is a secondary aggregation of multiple common indicators. Therefore, common indicators need to be added first.

[Required] [Chart order] The chart is arranged in positive order, which controls the display sequence of indicator charts on the data page.

Mandatory [Indicator code] The UUID short code is automatically generated by default.

Appender is the name of the log4j log file Appender, which must be referenced by Logger ref. No need to specify if intrusive collection data is used.

Optional [Keyword] Indicates the keyword for filtering the lines of log files.

[Optional] [separator] Indicates the symbol used to separate a single log column. It is usually a comma (,) or other symbols.

6. Comparison of mainstream monitoring

6.1 Zabbix

Zabbix was born in 1998. Its core components are developed in C language and the Web side is developed in PHP. Zabbix is an excellent representative of the old monitoring system, which can monitor network parameters, server health and software integrity and is widely used.

Zabbix uses MySQL for data storage, so OpenTSDB does not support the Tag feature, so it cannot perform aggregation statistics and alarm configuration according to multidimensional dimensions, which is not flexible to use. Zabbix does not provide a corresponding SDK, application layer monitoring support is limited, and no monitoring we developed provides intrusive burying point and acquisition functions.

In general, Zabbix has higher maturity and high integration resulting in poor flexibility. As the monitoring complexity increases, the difficulty of customization will increase. Moreover, the MySQL relational database used is a problem for large-scale monitoring data insertion and query.

6.2 the Open – Falcon

OpenFalcon is an enterprise-class, highly available, and extensible open source monitoring solution that provides real-time alarm and data monitoring functions. It is developed using Go and Python and is open sourced by Xiaomi. Falcon makes it easy to monitor the state of the entire server, such as disk space, port viability, network traffic, and so on. Based on proxy-gateway, it is easy to implement application layer monitoring (such as interface traffic and time consumption monitoring) and other personalized monitoring requirements through independent buried points, facilitating integration.

The official architecture diagram is as follows:

6.3 Prometheus

Prometheus is an open source monitoring alarm system and time sequence database (TSDB) developed by SoundCloud, and an open source version of Google’s BorgMon monitoring system, which Prometheus uses the Go language.

Similar to Open-Falcon of Xiaomi, OpenTSDB introduces Tag into the data model, which can support multi-dimensional aggregation statistics and alarm rule setting, greatly improving the use efficiency. Monitoring data is stored directly in the native timing database of Prometheus Server, and a single instance can process millions of Metrics. The architecture is simple, independent of external storage, and a single Server node can work directly.

The official architecture diagram is as follows:

6.4 Vivo Server Monitoring VMonitor

As a background monitoring management system, VMonitor can visually view alarms, configure service indicators, and monitor THE JVM, system, and service. Through the queue of the collection layer (vmonitor-collector collector) and the data agent layer (vmonitor-gateway gateway) to realize the maximum recovery of disaster data.

SDK is provided to facilitate business integration, support log output filtering, intrusive code reporting data and other application layer monitoring statistics. Based on OpenTSDB timing open source database, its source code is modified, adding nimaVG function, and combining with its own Zimsum function to meet the requirement of null value insertion, with strong data aggregation capability. Provides real-time, multi-dimensional, and multi-channel alarm services.

Seven,

This paper mainly introduces the design and evolution of Vivo server monitoring architecture, which is a real-time monitoring system based on Java technology stack. At the same time, it simply lists several types of mainstream monitoring systems in the industry, hoping to help you to understand the monitoring system, and make a more appropriate choice in technology selection.

Monitoring system involves the inside surface is very wide, is a large and complex system, this paper simply introduces the JVM monitoring server monitoring, system monitoring and business monitoring (including log monitoring and invasive reporting tools code), did not involve the client monitoring and whole link monitoring, etc., if you want to understand thoroughly, must do in-depth theory combining with practice.

Author: Vivo Internet Server Team -Deng Haibo