When it comes to monitoring, there are many mature solutions, and it seems no longer necessary to discuss. However, with the development of computer technology, more and more scenes begin to need monitoring, and it will develop in a deeper and broader direction. From the perspective of business fields, operation and maintenance, power, transportation, industrial control, coal, oil and gas, scientific research, all industries with quantitative monitoring indicators need data collection, storage, analysis and visualization. However, the focus of monitoring can vary greatly from business to business.

Overview of o&M monitoring system

This article will limit the business domain to IT operations monitoring and talk about some of the things that need to be addressed. IT operation and maintenance monitoring data is usually time sensitive data, such as

  • System resource indicators: CPU, memory, I/O, and bandwidth
  • Software system metrics: survivability, number of connections, number of requests, number of timeouts, number of errors, response time, type of service, and other business-related metrics

These data come from a variety of sources. You can use the operating system commands, such as free, vmstat, SAR, and iostat. If the program is deployed on containers such as Tomcat, WebLogic, IIS, and JBoss, these containers provide external collection interfaces. But the indicators that really matter to software systems, that businesses really care about, come primarily from data burial points. There are many things to be buried, both at the front and back ends, and they are usually business behaviors. However, how to store and analyze the data after burial is an important issue in the monitoring system. From the data collected in real time above, the main objectives are to obtain the following information

  • Health status of the host
  • Is the software running normally? How stable is the system
  • Software load, can it meet online performance requirements, need to add instances, where are performance bottlenecks, how to improve performance
  • User behavior analysis, user portrait

Composition of the operation and maintenance monitoring system

A typical monitoring system is usually divided into five modules: data acquisition, data transmission, data storage, data statistical analysis and data visualization. The analysis results will finally be fed back to product managers or software developers to ensure the stable operation of online software and further provide key information to improve the software.

1. Data acquisition module

Data collection is the first step of the monitoring system. Whether the collected information is rich, accurate and real-time directly affects the application effect of the monitoring system.

For data collection on host state and basic software performance, the easiest to use is Telegraf, a plug-in driven server agent that extracts metrics, events, and logs directly from the containers and systems it runs on, and from third-party apis. Telegraf also has output plug-ins that send metrics to a variety of other data stores, services, and message queues, including InfluxDB, OpenTSDB, TDengine, NSQ, and more.

If the log data generated by the software system is collected, the developers of the software system need to complete it. Log data can be classified into structured logs and unstructured logs. Data that is useful for business analysis is usually structured, but it is only represented in unstructured text. Many software developers write logging programs because they are lazy, but often not aware of it. These logs thus vary widely in format and have a poor compression ratio, which pushes the data storage costs and data analysis workload to downstream data analysts, who are limited by the development tools they use and are basically unable to provide accurate and real-time analysis results. As it happens, the easiest thing to change in a software system with the least impact is the log module, so it is important and simple to abstract, improve, and structure the log module.

2. Data transmission module

The network environment is not the main problem to be considered by the monitoring system. However, considering the size and real-time requirements of the monitoring data, you can distinguish slow logs from fast logs.

For fast logs, the most popular transport is a RESTFul interface. The difference is whether to choose Pull or Push. If the device is associated with many services, the Push mode is recommended to ensure real-time performance and do not need to cache data. The Pull mode is relatively simple. Generally, the monitored software system only needs to provide the Http interface. It is suitable for pulling some simple values, such as system status, access quantity, access time, etc. Generally, fast logs need to be stored in a real-time analysis system to generate real-time reports. The TDengine provides RESTFul interfaces to quickly process Push Http requests and real-time logs.

Slow logs are usually recorded in a log file, and then a separate general purpose log collector is used to write the logs to Kafka and then stream them out. After that, the log machine consumes the data and enters the data storage module.

3. Data storage module

Data storage selection is very important in the monitoring system. There are many big data engines to choose from. Optimized for time series data, such as Prometheus, InfluxDB, TDengine, ClickHouse, OpenTSDB, Graphite and other time series databases; General analytical, such as the Hadoop system and its streaming computing engine. Specific how to choose? Again, from the point of view of the type of data recorded, the indicators of concern can be from the write speed, acquisition frequency, data compression ratio, query and analysis speed.

If it’s not tied to time, it’s better to use Hadoop to handle such issues. If it is time series data, it is better to use time series database. For the time series database, InfluxDB, OpenTSDB, Cassendra and MongoDB can be used if the data is unstructured; Prometheus, ClickHouse and TDengine can be used if the data is structured. Of the latter three, Prometheus is limited by design and requires proper consideration for horizontal expansion; ClickHouse is more analytical than real-time data; TDengine has been around for a short time, but it has excellent performance in write speed, query speed, and compression ratio.

4. Data statistical analysis module

The goals of statistical analysis should not be limited by the imagination of the storage engine chosen. But generally speaking, statistical analysis of monitoring data is also a series of analysis related to time series, which can be divided into two categories

  • Real-time analysis: latest value, real-time curve, flow calculation, sliding window, history section, etc
  • Non – real-time analysis: annual report, monthly report, daily report, grouping, aggregation, etc

The query performance of these metrics is a key factor in selecting a data storage engine. The TDengine has excellent query performance. It can transform most traditional non-real-time analysis into real-time analysis. Taking full advantage of this feature, it can provide users with new functions and further expand new businesses.

5. Data visualization module

In terms of data visualization, there aren’t many open source visualizations available other than Grafana. If used within the department, it is sufficient; If it is an external project or needs to provide data across departments, you need to write your own interface that is easier to use and richer in query conditions to show the calculation results of real-time or scheduled tasks, in order to get better feedback.

Rapid establishment of operation and maintenance monitoring system based on TDengine

Referring to the white paper of TDengine, it innovatively defines the storage structure of time series data, and has the characteristics of easy installation and use, high compression ratio, and good query performance. It is especially suitable for processing real-time monitoring data. Monitoring logic relevant to specific operations is not easy to illustrate. However, since TDengine can be rapidly integrated with the open source data acquisition system Telegraf and the open source data visualization system Grafana, this section refers to the user manuals of the above systems to quickly build the operation and maintenance data monitoring system without any code development.

1. The architecture diagram

2. Install and configure the TDengine

  • Download tdengine – 1.6.0.0. Tar. Gz, address http://www.taosdata.com/downloads/
  • To install the TDengine, decompress it and run install.sh
  • Start the TDengine and run sudo systemctl start taosd
  • To check whether the installation is successful, run the TAos shell command line program of the TDengine. Information similar to the following is displayed
1Welcome to the TDengine shell, server version:1.6.0  client version:1.6.0

2Copyright (c) 2017 by TAOS Data, Inc. All rights reserved.

3

4taos> 

Copy the code

3. Install and configure Telegraf

  • Download telegraf_1. 7.4 1 _amd64. Deb, address https://portal.influxdata.com/downloads/
  • Install telegraf, sudo DPKG -i telegraf_1.7.4-1_amd64.deb
  • Configure telegraf, modify the telegraf configuration file/etc/telegraf/telegraf TDengine related configuration items in the conf

In the Output Plugins section, modify the [[outputs. HTTP]] configuration item

1Url: http://ip:6020/telegraf/udb, in which the IP for TDengine cluster in either the IP address of the server, 6020 for TDengine RESTful interface port, telegraf for fixed keywords, udb as the database name used to store the collected data, Precreated

2method: "POST" 

3Username: indicates the username for logging in to the TDengine

4Password: password for logging in to the TDengine

5data_format: "json"

6json_timestamp_units: "1ms"

Copy the code

For example,

1[[outputs.http]]

2Url = "http://127.0.0.1:6020/telegraf/udb"

3   method = "POST"

4   username = "root"

5   password = "taosdata"

6   data_format = "json"

7   json_timestamp_units = "1ms"

Copy the code

In the Agent section, modify the following configuration items

1Hostname: Indicates the name of the collection device. Ensure that the name is unique

2Metric_batch_size: 30, allows Telegraf to write the maximum number of records per batch. Increasing the number can reduce the frequency of Telegraf's request sending, but for TDegine, the value cannot exceed 50

Copy the code

For example,

1[agent]

2   hostname = "gsl"

3   metric_batch_size = 30

4   interval = "10s"

5   debug = true

6   omit_hostname = false

Copy the code

  • Start telegraf, sudo systemctl start telegraf
  • Test whether data from Telegraf is received
    • Enter the show Databases statement in your shell and you should see a database named UDB
    • Run the use UDb statement
    • Run the show stables statement to see the super table for cpus
    • Run the show stables statement to see the common data table cpu_Gsl_CPU0

3. Install and configure Grafana

  • Download grafana_6. 2.5 _amd64. Deb, address https://grafana.com/grafana/download
  • Install Grafana, sudo DKG -i grafana_6.2.5_amd64.deb
  • Configuration Grafana TDengine Grafana plugin/usr/local/taos/connector/Grafana directory, will be copied to/var/lib/Grafana/plugins directory
  • Start Grafana, sudo systemctl start taosd
  • You can directly log in to the Grafana server (user name/password :admin/admin) using the url of localhost:3000 and configure the TDengine data source. You can see the data source type of the TDengine in the data source list

  • Enter http://localhost:6020 in the Host text box and save

  • You can then see the data sources for the NEWLY created TDengine in Grafana’s list of data sources

  • Use the TDengine data source when creating dashboards

Click the Add Query button to Add three new queries. In the INPUT SQL INPUT box, enter the Query SQL statement whose result set should be two rows and more than columns of curve data, for example

1select avg(f_usage_idle) from udb.cpu WHERE ts>=$from and ts<$to interval($interval)

2

3$from, $to, and $interval are built-in variables of the TDengine plug-in, representing the query scope and interval obtained from the Grafana plug-in panel

4Click the GENERATE SQL button to see the SQL statement from Grafana to the TDengine.

5

6Select avg(F_USage_idle) from UDB. CPU WHERE ts>='2019-07-04 T01:23:44.509z 'and ts<'2019-07-04 T07:23:44.511z' interval(20000a)

Copy the code

conclusion

Monitoring system can use a lot of technical solutions, if only to make a toy, the choice is very large. However, if you are monitoring a large amount of data, you may want to try the TDengine for scenarios where writing, especially analysis, performance is high.

This article only briefly discusses the use of the TDengine. To truly understand its strong reading and writing ability, it is necessary to further construct a large test data set.

About TDengine

TDengine is a high performance, scalable, reliable and zero-management IOT big data platform software with independent intellectual property rights. TDengine can fully integrate database, cache, message queue, streaming computing and other functions. Because according to the characteristics of the Internet of things big data made all sorts of optimization, TDengine data insert, query performance is better than gm’s big data platform more than 10 times, has been greatly save the storage space, using SQL interface, can seamlessly integrate with third party software, greatly simplifies the system architecture of Internet of things platform, significantly reduce the complexity and cost of research and development and operations. TDengine can be widely used in the Internet of Things, Internet of vehicles, industrial big data and other fields. On July 12, 2019, TDengine became open source and ranked # 1 on GitHub’s Global Trends rankings for several days.

TDengine has more than 10,000 stars on GitHub: github.com/taosdata/TD… And welcome to Star Us on GitHub!