Article | Chen Anqi (Name: Aoqing)

Senior development Engineer of Ant Group

Responsible for the implementation and product construction of ant Prometheus monitoring of original functions in Ant Group

Read this article in 6566 words for 15 minutes

Before the speech

Logs and metrics are two indispensable data sources for monitoring, providing complete observability for application systems.

AntMonitor, as a unified monitoring platform for ants, is a monitoring product that mainly collects monitoring data in the form of logs. In the community, open source cloud native monitoring is implemented primarily in the form of Metrics, represented by Prometheus.

Prometheus monitoring has a wide user base in the industry due to its powerful user capabilities and post-analysis capabilities combined with PromQL. In fact, it has become an open source standard, widely used in open-source Kubernetes cluster monitoring.

Prometheus itself is a stand-alone monitoring product that is implemented in a highly available ant cluster with certain limitations and difficulties, including but not limited to:

  1. Prometheus does not provide stable and long-term data storage for historical data query in ant scenarios;

  2. Prometheus is lossy monitoring, which does not guarantee the integrity of data and fails to meet the accuracy requirements of data such as transaction number in ant scenarios;

  3. Prometheus does not support log monitoring.

However, after nearly two years of efforts, we successfully integrated the main functions of Prometheus into AntMonitor’s existing architecture to provide a cluster solution featuring ant scenarios.

This year, we successfully migrated Sigma cluster monitoring (ant’s native cluster management and resource scheduling platform) to AntMonitor. Coordinate with the alarm and the market to achieve the coverage of Sigma monitoring. AntMonitor successfully incubated a complete cloud native Prometheus monitoring capability with the migration of Sigma monitoring.

This paper briefly introduces AntMonitor’s support for Prometheus monitoring function and implementation of Sigma monitoring.

PART. 1 Collection architecture under massive data

Here’s an example of Prometheus Metrics data:

Java $curl localhost:8080/metrics #HELP go_gc_duration_seconds A summary of the pause duration of garbage collection Cycles. #TYPE GO_gc_durATION_seconds summary go_gc_duration_seconds{quantile=”0″} 7.2124E-05 0.000105153 go_gc_duration_seconds go_gc_duration_seconds {quantile = “0.25”} {quantile = “0.5”} 0.000129333 0.000159649 go_gc_duration_seconds go_gc_duration_seconds {quantile = “0.75”} {quantile = “1”} 0.070438398 Go_gc_duration_seconds_sum 11.413964775 GO_gc_duration_seconds_count 20020 #HELP GO_goroutines Number of goroutines that currently exist. #TYPE go_goroutines gauge go_goroutines 47 #HELP go_info Information about the Go environment. #TYPE Go_info gauge go_info{version=” GO1.15.2 “} 1 #HELP go_memSTATS_alloc_bytes Number of bytes and still in use. Go_memstats_alloc_bytes = $echo +06, +,

There are significant differences between Metrics data and log data, including but not limited to:

First, the log data is written on disk, and each log has a standard timestamp, while Prometheus Metrics is stored in memory, and the collection time is based on the pull time, so the accuracy of data has high requirements for the accuracy of scheduling.

Second, each log collection is incremental. The amount of data collected each time is limited, but Metrics data should be collected in full each time. The text size of data is often hundreds of MB, so the amount of Metrics data is huge and easily exceeded.

Third, log data needs to be sliced and cleaned and aggregated according to some fixed schema (data table structure), but native Metrics usually stores raw data on a single machine.

The existing data link of AntMonitor is basically that agents collect log data and cache it in memory. Spark computing cluster pulls data from agent memory, aggregates it, and stores it in CeresDB.

Metrics data, however, are different from log data in that standalone detail data is often observable and has complete information. So generally, Metrics data can be skipped and stored directly. Therefore, it is inappropriate to pull Metrics data from the Agent memory and Spark when the detailed data of a single machine is reserved. It not only wastes computing resources, but also the Agent memory cannot support the huge amount of Metrics data.

Therefore, we provide two data links according to users’ business requirements:

  • If the detailed data of a single machine needs to be reserved, the detailed data collection link based on the gateway is used.

  • If data aggregation is required, use Spark to collect aggregated data.

1.1 Detailed data collection based on gateway

In order to save detailed data directly without computing cluster, we developed and launched a set of dedicated service Metrics data collection + push storage link.

Compared with traditional Agent collection, centralized collection does not need to be deployed on each physical machine, but only collects Metrics data through HTTP requests. The centralized collection delivers and schedules the collection configuration of Metrics in a structured manner, thus meeting the requirements of the Metrics collection for the accuracy of timestamp scheduling. In addition, the data collected by the central collection does not store in memory, but is directly pushed to PushGateway, which stores the data directly to the underlying CeresDB.

This acquisition scheme meets the requirements of Metrics for time accuracy and storage of stand-alone data. In addition, the Metrics data collection is coupled with the existing log collection, so that the two do not interfere with each other, freeing the high consumption of agent memory and computing resources.

This scheme has been successfully used for Prometheus acquisition of ant Sigma and other technology stacks and infrastructure, and currently processes hundreds of millions of indicator data per minute.

1.2 Spark based Aggregated Data Collection

Infrastructure monitoring represented by Sigma has a great demand for detailed data on a single machine. However, the retention of detailed data also has great disadvantages, such as: large amount of data, high storage consumption; Data query is time-consuming and requires a large amount of data reading.

However, for some service application users, they do not pay attention to the detailed data of single machine, but only pay attention to aggregated data of some dimensions. For example, in the room dimension, cluster dimension, etc. Therefore, in this scenario, storing detailed data causes a large storage waste and poor user experience during data query. Therefore, in this scenario, we retained the current traditional AntMonitor log link and aggregated and stored the collected Metrics standalone detailed data. In scenarios where services do not pay attention to the details of single machine data, this link saves storage space and improves user query speed.

However, unlike log monitoring data aggregation, aggregation rules must be configured by the user. Since the Metrics data itself contains schema information, we use automated configuration items, It automatically generates aggregate configurations for users of metric types such as Gauge, Counter and Histogram, freeing users from the tedious manual configuration of aggregation:

The following figure summarizes and compares the differences and advantages and disadvantages of data aggregation and non-aggregation:

Metadata unification under PART. 2 dimensional table system

Native Prometheus provides various service discovery mechanisms, such as the Targets for K8s service discovery through Apiserver. However, AntMonitor, as a unified ant monitoring system, obviously cannot automatically discover monitoring targets through Apiserver.

AntMonitor builds a relatively complete metadata dimension table system on the existing basis of log monitoring, including SOFA, Spanner, OB and other ant technology stack metadata. Metadata tells us where to collect monitoring data, corresponding to the native service discovery mechanism. In order to achieve the original function, we have made necessary transformation to some dimension tables. Here, we take the implementation practice of Sigma monitoring as an example to briefly introduce our metadata synchronization process.

2.1 Sigma Metadata Synchronization

The prerequisite for AntMonitor to perform Sigma monitoring is to obtain metadata, which tells us where to collect monitoring data.

Therefore, we design a metadata synchronization scheme of “full synchronization + incremental synchronization” based on RMC (unified metadata platform of ants). The former ensures complete and reliable metadata, while the latter is implemented based on AntQ to ensure real-time metadata.

As you can see from the figure, in order to align with the native Staleness functionality, the Sigma Pod metadata has a unified offline state.

Prometheus uses Relabeling to filter acquisition targets and metric Relabeling to edit pulled data. Sigma metadata synchronization also records some necessary Sigma pod labels, and supports the setting of black-and-white lists by these labels during collection configuration. Supports additional labels, realizing functions like Prometheus Relabling and Metric Relabeling. Cater to the experience of cloud native monitoring configuration.

PART. 3 Storage cluster in distributed architecture

Taking Sigma as an example, the Metrics data of Sigma needs to store the detailed data of a single machine, and many major components are at the 15s level, so the data write volume is huge. Currently, the data write volume is millions of data points per second. As a result, data query often exceeds the limit and cannot produce data results. CeresDB is therefore partitioned according to the Label label for such a large data volume scenario. For example, Sigma Metrics data, usually labeled with cluster cluster, is partitioned by cluster dimension by CeresDB. When PromQL is used to query data, the query will be split down to the data node for execution according to the cluster dimension. Then, the final query result will be generated on the proxy node from the result produced by each data node and returned to the user.

Unlike the standalone version of Prometheus, CeresDB is a share-nothing distributed architecture with three main roles in the cluster:

  • Datanode: Stores specific Metric data. Generally, it is allocated sharding and stateful

  • Proxy: writes/queries routes in stateless state

  • Meta: stores stateless information such as fragments and tenants.

The approximate execution flow of a PromQL query is as follows:

  1. The proxy first parses a PromQL query statement into a syntax tree and identifies datanodes involved based on the fragment information in the META

  2. Send nodes in the syntax tree that can be pushed down through RPC to the Datanode

  3. The proxy accepts the returned value of all Datanodes, executes the non-push-down compute nodes in the syntax tree, and returns the final result to the client

Let’s take this PromQL query as an example, executed as follows:

,, Java sum(rate(write_duration_sum[5m]))/sum(rate(write_duration_count[5m])),,,

This sharding algorithm, along with other computational optimizations made by CeresDB, greatly improves the efficiency of Sigma’s query data. For the native Prometheus + Thanos architecture, most of the time consuming queries are now two to five times more efficient.

PART. 4 Grafana Component

AntMonitor’s visualization is primarily designed for log monitoring, while Grafana is commonly used by the community to solve the visualization problems of cloud native monitoring. After comparison, we found that AntMonitor’s original visualization could not support PromQL’s visualization capability. To do this, we need to find solutions that both embed the existing architecture and meet the native cloud visualization requirements.

The Grafana front-end code is mainly written by AngularJS 1.x + React, and AngularJS is responsible for the bootstrap function. Also, some of the common service classes are handled by AngularJS, while React takes care of the rendering layer and writing new functionality.

Those of you who are early adopters of SPA apps should know that AngularJS 1.x, as the first full-featured front-end framework, has a lot of code coupling that makes it hard to embed in other frameworks. We solved this problem by modifying the AngularJS source code to have the ability to repeat bootstrap so that it can be easily embedded into other frameworks. Also, we wrote a new React Component library for Grafana to facilitate integration with other businesses. So far, we have successfully exported this functionality to multiple scenarios for use.

After we had a front end, we didn’t need a whole set of Grafana components or a whole set of Grafana back-end apis, we implemented some of the Grafana apis that were necessary with SOFABoot, such as Dashboard, Query, annotation, Make it available to users with our componentized Component.

PART. 5 Rule execution engine in distributed architecture

Native Prometheus provides Recording Rules (RR) and Alerting Rules (AR) functions to predict and store indicators for daily query. This is a feature unique to Prometheus Metrics that solves the problem of long real-time queries mentioned above.

Based on this, we customized and developed a Rule execution engine matching Antmonitor Prometheus monitoring, which was compatible with the native YAML operation experience of editing, importing, and delivering RR/AR, and connected to various alarm gateways for alarm distribution through AlertManager. This RR/AR Rule execution engine has been successfully applied to Sigma monitoring, Antmonitor SLO service, system monitoring and other services.

5.1 Antmonitor SLO Service Access Practice

Antmonitor SLO service needs to calculate the performance of various service indicators in different time periods, such as success rate and time consumption of one day, one week, and one month. These can be delivered to the Rule execution engine as RR to be calculated and stored for users to query. SLO+ threshold can also be used as AR to calculate alarms and sent to the alarm gateway.

SLO RR has certain rules, most of which are in the form of sum_over_time(). Aiming at this characteristic, Rule execution engine realizes the calculation method based on sliding window, and overcomes various calculation errors caused by sliding, and effectively reduces the downstream calculation pressure.

5.2 Playback Recording Rules

The Rule execution engine not only supports real-time RR calculation but also historical RR recalculation. The Rule execution engine can automatically detect historical calculation results and recalculate overrides. The scenarios are as follows:

1) Abnormal data repair: During the running of the Rule execution engine, RR calculation may be abnormal due to various reasons, such as source data anomalies, underlying dependent service Pontus, and occasional failures of CeresDB. Abnormal data can lead to a mixed experience (success rate >100%), error transfer (in SLO scenarios, data exceptions at one point in time are continuously passed through the sum_over_time function), and continuous false alarms. Thus, a mechanism is needed to quickly recover abnormal data.

2) When a new RR set is connected to the Rule execution engine, generally speaking, users can only obtain the CALCULATION results of RR from the time of access. Users need to wait for a period of time to obtain a trend chart. If the trend chart that users want to view has a long time span, the time cost of waiting is too high. Historical RR recalculation can solve this problem by directly recalculating the RR calculation results of any interval before the access time.

PART. 6 Function Outlook

This year we successfully implemented Sigma monitoring in AntMonitor, built infrastructure capacity and accumulated practical business experience. Next year, we will shift our focus to serving C-end users and provide them with the ultimate smooth monitoring experience. While the Metrics capabilities of C-end users (business users) in AntMonitor are fragmented, we will be working on the integration of Prometheus capabilities next year to provide a smooth, one-stop experience.

[Special thanks]

Special thanks to the business side of the project, the Sigma SRE team for their valuable suggestions and opinions, as well as their tolerance and understanding of the difficulties encountered during the development process.

Special thanks to RMC team, the project metadata partner, for their friendly support of the project.

Special thanks to every member of the AntMonitor Cloud Native Monitoring project team for their help in this project.

【 References 】

– [Prometheus official Documentation]

prometheus.io/

– “Evolution of Prometheus on CeresDB”

Mp.weixin.qq.com/s/zrxDgBjut…

Recommended Reading of the Week

Prometheus on CeresDB Evolution process

Go deep into HTTP/3 (I) | The evolution of the protocol from the establishment and closure of QUIC links

Reduce cost and improve efficiency! The transformation of the registration center in ant Group

Practice of Etcd resolution of large scale Sigma cluster in ants