Monitoring System Overview

Functional division

A host CPU alarm is called monitoring; A business log error is called monitoring; The triggering of an APM condition, also called monitoring. The distributed system is very complex, so a random collection of statistical indicators also belongs to the category of monitoring. How to achieve the general, clear the relationship, is to spend some time, otherwise knead into a ball, it is not easy to dismantle.

I used to divide it from the following two types, and it is reasonable to divide the system according to the data quadrant: The data quadrant is divided from the data type, which can be roughly divided into: logs, metrics, and tracing. The function quadrant is divided into basic monitoring, middleware monitoring, and business monitoring

No matter what kind of monitoring system, there are several module processes involved: ❏ data collection. How to merge data in breadth and efficiency. ❏ Data processing. Data collation, transmission and storage. ❏ refers to extraction. Big data calculation, intermediate results generated and stored. ❏ Data display. High appearance level, multi-function display.

The typical implementation

Different monitoring modules, focusing on different areas, have different responsibilities. Personally, I prefer to design independently, but it needs to be integrated to a certain extent, mainly in the way of integration and optimization.

I’ll start with a few typical solutions and show you why you want to design them separately rather than in a wad.

System monitoring

metric

For metric collection, components that support diversity will be used first. For example, Telegraf supports all system metrics collection and most middleware and DB metrics collection.

In particular, the JolokiA2 component is recommended here, which makes it easy to implement functions such as JVM monitoring.

Once metrics have been collected, it is strongly recommended that you use Kafka for a buffer. In addition to its incredible stacking capability, it can also facilitate data sharing (such as pushing NGINx logs to security groups). Reference to this public number: [360 degree test: KAFKA will lose data? Its high availability to meet the needs?]

After the metric enters the message queue, a copy will be filtered through logStash and sorted into NoSQL such as ES. The other copy will go through the stream calculation and collect some indicators (such as QPS, average corresponding time, TP value, etc.).

There are several ways to calculate the aggregate calculation that triggers the alarm. Check back, stack statistics are commonly used. In the case of a particularly large amount of data, it is necessary to preprocess index data first; otherwise, even excellent DB such as InfluxDB cannot bear it.

In terms of presentation, Grafana is preferred due to its extremely high level of appearance, and supports embedding into other systems via iframe. The disadvantages are also significant: limited types of support (no year-over-year, no month-over-month, no slope); The alarm function doesn’t work very well.

How do? Think the stack is too long? Zabbix is an off-the-shelf solution that you can choose from, and it has a lot of add-ons that make it great for small companies. After all, the components are so concentrated that you can’t break them up, and you can’t arbitrarily replace the modules when you find a problem, which is a fatal wound in the architecture. Suddenly, the cost of implementation has increased. And that’s why no company of any size is using it.

It’s possible to develop one yourself, but it’s not easy, and there are complicated front-end problems to deal with, and not everyone can do it beautifully.

The log

When it comes to the log section, the first thing that comes to mind is ELKB. However, I think the link of ELKB is unstable and incomplete, so I suggest the following modifications:

nginx
SLA

This is where the collection part comes in with some proven log collection components. Logstash resource control is not very smart. To avoid competing for business resources, Flume and Beats are better choices.

Also, a message queue buffer is necessary, otherwise a large number of agents feigning death on the business side is no laughing matter.

About log drive. Many logs are not necessary to be stored in the library, such as the DEBUG logs that r & D students happily type out, so there should be log specifications. Logstash filters according to these specifications and dumps into ES. The amount of logs is usually large. It is better to set indexes by day. Logs that are older than that can either be stored in a log fortress (which is a very, very large disk) or stored in HDFS.

How do you filter out error conditions in the service log, such as how many XXX exceptions trigger alarms? In this case, you can either write a script or take a copy of the data, process it, generate a monitor, and throw it to the Metrics collector.

Ha! Ha! It goes back.

Tracing

Compared to normal monitoring and logging, call chains such as APM are much more complex. In addition to having a large number of data generation sources, there are business components that support invocation chain aggregation and presentation. Although its function display is simple, it is the most complex module in the monitoring system.

Google’s paper “Dapper, a Large-scale Distributed Systems Tracing Infrastructure” started the popularity of call chains. It was not until recent years that standards like OpenTracing emerged.

In terms of data processing and subsequent presentation, its technical point is actually similar to that of monitoring technology, and the complexity is mainly reflected in the collection of call chain data. The current implementation, there are similar to Pinpoint this direct use of JavaAgent technology to modify the bytecode; There are things like CAT that code directly. Each has its advantages and disadvantages, but both need to address the following issues: ❏ Isomerization of collection components. Development languages can range from Java to Golang ❏ with a variety of components. Starting from the front-end buried point, nGINx, middleware, DB and other links need to include ❏ technical difficulties. Such as asynchronous, interprocess context passing, etc ❏ sampling. Especially in the mass call, both to ensure accuracy, but also to ensure efficiency

The data structure of Tracing has been extensively discussed, so I won’t say much about it here. The various implementations are also separate, with incompatible agreements and a lot of duplication of work. In order to solve the problem of incompatible apis of different distributed tracing systems, the OpenTracing (OpenTracing. IO /) specification was born. In other words, it is a set of interface definitions that are compatible with mainstream invocation chain server implementations such as Zipkin and Jaeger. This means that as long as you provide the data according to the specification, zipkin will collect and display it.

OpenTracing is dominating the world by integrating the concepts of Tracing, Log, and Metrics. Let’s go to the picture, and it’s not too late to learn about it when you actually do it (source: Internet).

Github.com/opentracing…

As for “Spring Cloud 2”, it also integrates microMeter, which is also very good for Prometheus. Business metrics monitoring will save a lot of effort.

So much for that, just a little bit about collecting. Standards are a good idea, otherwise it would be boring for every company to write a huge collection of components. As for the products of big domestic companies, whether they will take the initiative to close to it, we will wait and see.

On the server side, let’s use Uber’s Jaeger as an example to illustrate the general components required on the server side.

Jaeger Client – that’s the agent-daemon that listens for SPAN data on a UDP port and sends it in batches to a collector who receives the data from jaeger-Agent. The data is then written to the back-end storage. Data Store – The back-end storage is designed as a pluggable component that enables Data to be written to Cassandra, ES Query – receives the Query request, then retrieves the trace from the backend storage system and displays it through the UI

Yes, in a typical stateless system, peer downtime has no impact.

Analysis and early warning

The graph above mentions stream computation more than once, and this does not necessarily require Spark Streaming entirely. Receiving a single piece of data from Kafka and handling it yourself is also called stream computation, and using the latest Kafka stream is not an option either. The important thing is aggregation, aggregation, aggregation. Say the important thing three times.

In general, do some QPS, RT, etc., just pure counting; Or the slope, which is how fast it goes up and down; More complex are the TP values (what percentage of requests are corresponding in XX seconds), the service topology of the invocation chain, and the statistics of log exceptions, which can be calculated here.

Fortunately, the apis for stream computing are fairly simple, but debugging is a lot of work.

The analyzed data belongs to the processing data and should be stored separately. Of course, if the amount is small, it can also be mixed with the original data. The amount of analyzed data is evaluable. For example, one piece of data in 5 seconds is fixed to 17,280 pieces in a day. The early warning module reads the analyzed data (the original data is too large), which also involves a lot of calculation.

So what are the analyzed statistics used for? Part of it is warning; The other part is the presentation.

warning

In a prototype I designed, a metric operation would look something like this:

show

There are a lot of visual JS libraries out there, but it’s usually a lot of work. But no way, simple configuration with Grafana can be done, more complex still need to be done. I have two simple grafana diagrams for you to look at.

summary

Overall, the system collects, processes, and applies three types of data (logs, metrics, and Trace). The experience of some of these components can be shared, but the collection and application parts are very different. I have tried to summarize a diagram, but it is ambiguous to see only what components are involved. Each structure is different in terms of data flow and processing, which is why I always emphasize divide and conquer. But use way, had better differ not too big. No matter how complex the backend architecture is, an overall look will make the product more clear. Is that where your current work is focused?

Some components

By understanding the above content, you can understand that we habitually disassemble all the modules of the monitoring system, and you can easily replace the components in it. For example, Beat replaces Flume, Cassandra replaces ES…

Below I will list some commonly used components, if missing, welcome to add.

Data collection component

telegraf

Used to collect monitoring entries, a member of the InfluxData family, is an agent written in Go that collects system and service statistics and writes them to various databases. The types of support are very broad.

flume

Mainly used to collect log class data, apache family. Flume-og and Flume-ng versions differ greatly. Flume is a highly available, highly reliable and distributed system for mass log collection, aggregation and transmission. Flume supports customization of various types of data sender in the log system for data collection. Flume also provides the ability to simply process the data and write it to various data recipients (customizable).

logstash

Logstash is an open source log collection and management tool, a member of the Elastic family. Flume has the same function as Flume, but consumes a lot of resources. Therefore, you are advised to deploy it independently. Rich features, support ruby definition filter conditions.

StatsD

Node is developed and transmitted using UDP protocol. It is specially used to collect data. After collecting data, it is sent to other servers for processing. Similar to Telegraf.

CollectD

Collectd is a daemon process that periodically collects system and application performance metrics and provides a mechanism to store these metrics in different ways.

visualization

There are few independent visual components, but solutions usually have a web side, not many focused ones like Grafana.

grafana

Focus on presentation, high level of appearance, integration of very rich data sources. Through simple configuration, you can get a very professional monitoring chart.

storage

There are many uses and few, you can see here: grafana.com/plugins?typ…

influxdb

Products from the Influx family. Influxdb is an open source distributed database of timing, time, and metrics written in the GO language with no external dependencies. The supported data types are very rich and the performance is very high. There is no charge for using a single node, but there is a charge for its cluster.

opentsdb

OpenTSDB is a time series database. In fact, OpenTSDB is not a DB. A single OpenTSDB cannot store any data. It is only a data read and write service, or more accurately, a data read and write service based on Hbase. It can withstand massive amounts of distributed data.

elasticsearch

Can store monitoring items, can store logs, trace relationships can also be stored. Supports rich aggregation functions that enable very complex functionality. However, if the time span is too large, the design of indexes and sharding too many, ES is prone to confusion.

The solution

open-falcon

Produced by Xiaomi, it actually contains agent, processing, storage and other modules, and has its own dashboard, which can be regarded as a solution, like it. But at present, it is less used, and the domestic open source stuff is very important, as you know: the company is blowing high and high, but the community is using semi-finished products.

Graphite

Instead of collecting the metrics themselves, Graphite receives them through the back end like a database, and then queries, transforms and combines them in real time. Graphite supports a built-in Web interface that allows users to browse metrics and graphs. It’s been going really well lately, and it’s always paired with Collectd. Grafana also integrates it as the data source by default.

prometheus

Golang development, the development trend is good. Inspired by Google’s Borgmon monitoring system, it was released in 2015, so it’s relatively young. Prometheus was ambitious, as his name suggests – Prometheus. In addition, it is well supported by SpringCloud.

Traditional monitoring

Graphics are still rendered using AWT or GD. It always feels like these things are on the verge of extinction.

zabbix

It’s so large that I don’t need to introduce it. But as you get more nodes and more services, around 1K, you hit a bottleneck (including development customization bottlenecks). On the whole, small companies use it very well and big companies use it very poorly.

nagios

It is also relatively old, long time more customers. Its installation and configuration is relatively complex. The function is not complete and specific, I don’t like it very much.

ganglia

Ganglia’s core consists of Gmond, Gmetad, and a Web front end. It monitors system performance, such as CPU, MEM, disk usage, I/O load, and network traffic. Curves show the working status of each node, which helps adjust and allocate system resources to improve the overall system performance.

Centreon

It’s a cliche, but it works perfectly with Nagios. Do you still use it?

APM

APM is one of the more specific areas, and there are many implementations. Attached: List of APMs that support OpenTracing

cat

In fact, there are very few things in Meituan, most of which are made up of reviewers. CAT, as the basic monitoring component of Meituan Dianping, has been widely used in the middleware framework (MVC framework, RPC framework, database framework, cache framework, etc.) to provide system performance indicators, health status, basic alarms, etc., for each business line of Meituan Dianping. It’s a bit of a conscience, but it’s very invasive, and it’s a nightmare if you have a lot of code. The technical implementation is old, but the control is strong.

Pinpoint

Pinpoint is an open-source APM monitoring tool on Github, which is written in Java for large-scale distributed system monitoring. The installation of Agent is non-invasive, that is, using Java instrument technology, is destined to be Java.

SkyWalking

With Huawei label, it is similar to Pinpoint, which uses probe to collect data. The 2015 product uses ES as storage. Enter Apache oh, support Opentracing.

zipkin

Well, Zipkin goes this route, too, but zipkin supports the OpenTracing protocol, so you can replace it with a plugin, which is generous.

jaeger

Golang development, Uber products, small and exquisite, support OpenTracing protocol. The Web side is also very nice, providing back-end storage for ES and Cassandra.

other

Datadog

Here’s a solution for a fee. Why is that? Because it’s beautifully done. Appearance level control, no way. In addition, I have written a lot of implementation code and documentation with high quality. Read it if you want to develop your own.

The complexity of the monitoring system lies in its complexity. How to clarify the relationship and give users a reasonable thinking mode is the most important. The so-called product experience is given priority. From the whole development process, we can see that standardization is the best improvement of technology, but also through a turbulent age. Vested interests will maintain their own barriers, refuse to accept and open up, and then suddenly find that their own things are behind The Times.

With all these monitoring components, there’s always one that’s right for you