Summary: For the construction of a monitoring system platform for internal use of the company, there is a relatively large number of options, whether it is built with open source solutions or using commercial SaaS products, there are many options. However, whether it is open source solution or commercial SaaS product, the real implementation needs to consider how to give data to the monitoring platform, or how the monitoring platform to obtain the data. This involves the selection of data acquisition mode: Pull or Push mode?

The author source b | | yuan ali technology to the public

A variety of surveillance systems

Monitoring has always been a core component of IT systems, responsible for problem detection and auxiliary positioning. Whether it is traditional operations, SRE, DevOps, developers need to pay attention to monitoring systems and participate in the construction and optimization of monitoring systems. Since the beginning of the mainframe operating system, Linux basic indicators, monitoring systems have begun to appear and gradually evolve. At present, there are no less than hundreds of monitoring systems that can be searched, and there are many classification methods according to different categories, such as:

  1. Monitoring object: General (the general monitoring mode applies to most monitoring objects), specific (customized for a function, such as the Java JMX system, CPU high temperature protection, hard disk power failure protection, UPS switching system, switch monitoring system, and dedicated line monitoring).
  2. Data acquisition mode: Push (CollectD, Zabbix, and InfluxDB). Pull (Prometheus, SNMP, JMX);
  3. Deployment mode: coupling (deployed with the monitored system); Single-node deployment (single-node, single-instance deployment) Distributed (can scale horizontally); SaaS (many commercial companies offer SaaS without deployment);
  4. Data acquisition method: mouth type (can only be taken through some API); DSL (which can have some calculations, such as PromQL, GraphQL); SQL (standard SQL, SQL class);
  5. Commercial properties: Open source and free (e.g. Prometheus, InfluxDB standalone); Open source commercial (e.g. InfluxDB Cluster edition, Elastic Search X-Pack); Closed source business (e.g. DataDog, Splunk, AWS Cloud Watch);

2 Pull or Push

For the construction of a monitoring system platform for internal use of the company, there are relatively many options, whether it is built by open source solutions or using commercial SaaS products, there are many options. However, whether it is open source solution or commercial SaaS product, the real implementation needs to consider how to give data to the monitoring platform, or how the monitoring platform to obtain the data. This involves the selection of data acquisition mode: Pull or Push mode?

As the name implies, the monitoring system based on the Pull type proactively obtains indicators, requiring the monitored object to be able to be remotely accessed. The monitoring system based on Push type does not actively obtain data, but actively Push indicators by the monitoring object. There are differences between the two ways in many places. For the construction and selection of monitoring systems, it is necessary to understand the advantages and disadvantages of these two ways in advance and choose the appropriate scheme to implement. Otherwise, if blindly implemented, the subsequent stability and deployment operation and maintenance cost of monitoring systems will be disastrous.

Three Pull vs Push overview

The following will be introduced from several aspects. In order to save readers’ time, a table is used to make an overview of the discussion. Details will be expanded later:

Four principles and architecture comparison

As shown in the figure above, the core of the Pull model data acquisition is the Pull module, which is usually deployed with a monitoring back-end, such as Prometheus. The core consists of:

  1. Service discovery system, including host service discovery (dependent on the CMDB system), application service discovery (such as Consul), PaaS service discovery (such as Kubernetes); The Pull module needs to be able to connect to these service discovery systems
  2. The Pull core module, except the service discovery part, generally uses the general protocol to Pull data from the remote end, generally supports the configuration of Pull interval, timeout interval, indicator filtering /Rename/ simple Process ability
  3. The application side SDK supports listening on a fixed port to provide the ability to be pulled
  4. Since various types of middleware or other systems are not compatible with the Pull protocol, you need to develop a peer Agent that can Pull indicators of these systems and provide a standard Pull interface

The Push model is relatively simple:

  1. Push Agent: Pulls indicator data of monitored objects and pushes it to the server. The Push Agent can be deployed with the monitored system or independently
  2. ConfigCenter (optional) Provides centralized dynamic configuration capabilities, such as monitoring targets, collection intervals, indicator filtering, indicator processing, and remote targets
  3. Application side SDK, which supports sending data to the monitoring back end or to the local Agent (usually the local Agent also implements a set of back-end interfaces)

Summary: Purely from the perspective of deployment complexity, in the monitoring of middleware/other systems, the deployment mode of Pull model is too complex, and the maintenance cost is high. It is convenient to use Push mode. There is little difference in the cost of providing Metrics ports or active Push deployment.

Five Pull distributed solutions

In terms of scalability, data collection in Push mode is naturally distributed, and it can be extended horizontally indefinitely when the monitoring back-end capability can keep up. In contrast, the Pull mode is more troublesome and requires:

  1. The Pull module is decoupled from the monitoring backend, and the Pull module is deployed as an Agent
  2. Pull agents need to carry out distributed coordination. Generally, Sharding is the simplest method. For example, the list of monitored machines is obtained from the service discovery system, and these machines are Hash and Sharding are taken to determine which Agent is responsible for Pull.
  3. Add a configuration center (optional) to manage pullAgents

As you can see, there are some problems with this distributed approach:

  1. The single point bottleneck still exists, all agents need to request the service discovery module
  2. After Agent capacity expansion, monitoring targets may change, causing data duplication or missing

Comparison of monitoring capabilities

1 monitor target viability

In Pull mode, it is relatively simple to monitor the target storage activity. In Pull mode, it can know whether the target can request the indicator directly at the center of Pull. If the indicator fails, it can also know some simple errors, such as network timeout and the peer end’s refusal to connect.

Push method is relatively more troublesome, not report may be application hang up and also may be a network problem, can also be moved to other nodes, because of the Pull module real-time linkage can be found and services, but not a Push, so only the server to interact with the service discovery can know the cause of the failure.

2 Data integrity calculation

Data is complete degree of this concept is very important in the monitoring system of large, such as monitoring, one thousand copies of the application of QPS trading, the index need to overlay, one thousand data without the concept of degree of data to be complete, if the configuration QPS compared with a 2% reduction in the alarm, due to network, more than 20 copies reported data delay for a few seconds, That would trigger false positives. Therefore, you need to consider the data integrity when configuring alarms.

The calculation of data completeness also depends on the service discovery module. The Pull method is to Pull the data one round at a time, so the data is complete after one round of Pull. Even if partial Pull fails, the percentage of incomplete data is also known.

In the Push mode, each Agent and application actively Push, and the Push interval and network delay of each client are different, requiring the server to calculate the data completeness according to the historical situation, which costs a lot.

3 Short life cycle /Serverless Application monitoring

In actual scenarios, there are also many applications with short life cycle /Serverless, especially in the case of cost-friendly, we will use a large number of jobs, elastic instances, non-service applications, etc. For example, when a rendering task arrives, we will start an elastic computing instance, which will be destroyed and released immediately after execution. Machine learning training jobs, event-driven service-free workflows, and jobs performed periodically (such as resource cleaning, capacity checking, and security scanning). These applications usually have a very short life cycle (maybe in the second or millisecond level), and the periodic model of Pull is extremely difficult to monitor. In general, the application is required to actively Push monitoring data in the way of Push.

To cope with such short-life applications, pure Pull systems provide an intermediate layer (such as Prometheus’ Push Gateway) that accepts the application’s active Push and provides Pull ports to the monitoring system. However, there are additional administrative and operational costs for multiple middle tiers, and since it is a Pull simulation of a Push, there is an increased latency for reporting and the need to clean up metrics that disappear immediately.

4 flexibility and coupling degree

In terms of flexibility, the Pull mode has a slight advantage. You can configure what indicators you want in the Pull module and do some simple calculations/secondary processing on the indicators. However, this advantage is also relative, Push SDK/Agent can also configure these parameters, with the existence of the configuration center, configuration management is also very simple.

Tell from the coupling, the Pull model and the back end coupling much lower, only need to provide a backend understandable interface, which the backend specific connection, and the back-end need what indicators don’t have to care about, such as relative division of labor is clear, application developers only need to expose their own indicators can, by SRE) (monitoring system administrator to obtain these indicators; The coupling degree of Push model is relatively higher, and the application needs to configure the back-end address and authentication information. However, with the help of the local Push Agent, the application only needs to Push the local address, and the cost is relatively small.

Operation and maintenance and cost comparison

1 Resource Cost

From the perspective of overall cost, there is little difference between the two methods, but from the perspective of ownership:

  1. In Pull mode, the core is consumed on the monitoring system and the cost is low on the application side
  2. The core consumption of Push mode is on the Push and Push agents, and the consumption of monitoring system is much smaller than that of Pull

2 Operation and maintenance costs

From the perspective of operation and maintenance, the cost of the Pull mode is relatively high. In the Pull mode, components that need to be operated and maintained include: various types of Exporter, service discovery, PullAgent, and monitoring backend. In Push mode, only o&M is required: Push Agent, monitoring backend, and configuration center (optional, usually deployed together with the monitoring backend).

  • Note that in Pull mode, because the server initiates requests to the client, cross-cluster connectivity and network protection ACLs on the application side need to be taken into account. Compared with Push mode, the network connectivity is simpler and the server only needs to provide a domain name /VIP for each node to access.

How to select the type

At present, the open source scheme, the representative of the Pull mode, is the family scheme of Prometheus (the family is called because the default single point of Prometheus has limited scalability, and there are many distributed schemes of Prometheus in the community. For example Thanos, VictoriaMetrics, Cortex, etc.), Push mode represents InfluxDB’s TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) scheme. Both of these schemes have their own advantages and disadvantages. In the background of cloud native, with the great fire of Prometheus led by CNCF and Kubernetes, many open source software began to provide the Pull port of Prometheus mode; But at the same time, there are many systems that are difficult to provide Pull ports at the beginning of their design, so it is more reasonable to use Push Agent to monitor these systems.

However, there is no good conclusion on whether to use Pull or Push in the application itself. The specific selection still needs to be based on the actual scene inside the company. For example, if the network of the company cluster is very complex, it is easier to use Push mode. There are many short life cycle applications that need to use Push; Mobile apps can only be pushed; The system itself uses Consul for service discovery, which can be easily implemented by exposing the Pull port.

Therefore, considering the overall situation, it is the best solution for the internal monitoring system of the company to have both Pull and Push capabilities:

  1. Push Agent is used to monitor host, process and middleware.
  2. Kubernetes and others directly exposed the use of Pull port Pull mode;
  3. Select Pull or Push based on the actual scenario.

Strategies of SLS on Pull and Push

SLS currently supports unified storage and analysis of logs, timing monitoring, and distributed link tracing. For timing monitoring schemes, which are compatible with Prometheus’s format standard, the standard PromQL syntax is provided. With hundreds of thousands of SLS users, application scenarios can vary greatly, and it is impossible to use a single Pull or Push to meet all customer needs. Therefore, SLS does not take a single route in the selection of Pull and Push models, but is compatible with Pull and Push models. In addition, for the open source community and Agent, SLS’s strategy is to be fully compatible with the open source ecosystem, instead of creating a closed ecosystem by itself:

  1. Pull model: fully compatible with Prometheus’ Pull Scrap capability. Use Prometheus’ Remote Write as the Pull Agent; A VMAgent with the same power as Prometheus Scrap can also be used in this way; SLS ‘own Agent Logtail also implements Prometheus’ Scrap capability
  2. Push model: Telegraf is the most perfect monitoring PushAgent ecosystem in the industry at present. The Logtail of SLS has Telegraf built-in, which can support hundreds of monitoring plug-ins of all Telegraf

Compared with Pull agents such as VMAgent and Prometheus and native Telegraf, SLS provides the most urgent Agent configuration center and Agent monitoring capability, which can manage the collection configuration of each Agent and monitor the running status of these agents on the server side. Minimize operation and maintenance management costs.

Therefore, the actual use of SLS for monitoring scheme construction is very simple:

  1. In the SLS console (Web page) to create a MetricStore to store monitoring data;
  2. Deploy the Logtail Agent (one command).
  3. Configure the collection configuration of monitoring data on the console (Pull, Push);

Ten summary

This paper mainly introduces the most tangled choice of Pull or Push in the monitoring system. The author compares various directions of Pull and Push based on years of practical experience and various customer scenarios encountered, which is only for your reference in the construction of the monitoring system. We also welcome your comments and discussion.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.