This article is my share in the second phase of OPS content construction team. Welcome to pay attention to the wechat public number: OPS Unbounded Engineer

Good evening, everyone. Tonight, I will share the solution of building a monitoring platform based on Telegraf + InfluxDB + Grafana. First, let’s get to know the influence XDB first. Influxdb is a high performance database specifically written for time series data, developed in GO language and open source! It is part of the TICK technology stack. It uses the TSM engine for high speed ingestion and data compression. It also provides high-performance write/query HTTP APIS with expression statements similar to SQL queries (but with different data structure concepts). See Also InfluxDB Design Insights and Tradeoffs.

To explain some of the names above, the TICK stack refers to four open-source monitoring products developed by InfluxData, including Telegraf, InfluxDB, Chronograf, and Kapacitor. InfluxDB is the most widely used and open source timing database. A common application scenario is log storage. Kapacitor provides an influxDB-based monitoring and alarm scheme that supports multiple data aggregation, selection, transformation, and prediction methods. Chronograf is used to present the data and can be replaced with the more powerful Grafana.

TSM engine, WHICH I am not familiar with, belongs to advanced knowledge, online information is not much, interested in the big guy can study:

influxdb

The timing series database is mainly used to store the monitoring data of the system. It generally has the following characteristics:

  • Efficient query with time as dimension
  • Convenient down sampling
  • Efficient processing of expired data

For the method of learning influxDB, I recommend that you get a basic understanding of It by referring to the Linux University’s Series of Tutorials on It (but you don’t need to get hung up on it, as some of the descriptions are outdated) before going further on to the official pages.

Download and install influxDB

# add yum source
cat <<EOF | sudo tee /etc/yum.repos.d/influxdb.repo
[influxdb]
name = InfluxDB Repository - RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/ \$basearch/stable enabled = 1 gpgcheck = 1 gpgkey = https://repos.influxdata.com/influxdb.key EOF sudo yum install influxdb influx  -versionCopy the code

Starting services and adding startup:

sudo service influxdb start
sudo systemctl start influxdb
Copy the code

The main concept

The concept of InfluxDB differs somewhat from that of traditional databases. I introduce some basic concepts. If you want to know more, please refer to the official InfluxDB Concepts

Compare them to nouns in traditional databases

influxDB Concepts in traditional databases
database The data table
measurement Tables in a database
points A row of data in a table

Unique concept for InfluxDB

The concept of InfluxDB is similar to that of a traditional database.

Point

Point is equivalent to a row of data in a table in a traditional database, consisting of timestamp, field, and tags.

Point attribute Concepts in traditional databases
timestamp Each data requires a timestamp (primary index & automatic generation), which is treated specially in the TSM storage engine to optimize subsequent query operations
field (Field key, Field set,field value) Various record values (attributes with no index), such as temperature
tag (Tag key,tag sets,tag value) Various indexed attributes, such as locale
series

Series is the collection of some data in InfluxDB. All the data in the database should be shown in a chart, and series represents the data in the table, which can be drawn as several lines (calculated by using tags) :

> show series from cpu
key
---
cpu,cpu=cpu-total,host=VM_42_233_centos
cpu,cpu=cpu0,host=VM_42_233_centos
cpu,cpu=cpu1,host=VM_42_233_centos
cpu,cpu=cpu2,host=VM_42_233_centos
cpu,cpu=cpu3,host=VM_42_233_centos
Copy the code

The code structure is as follows:

type Series struct {
    mu          sync.RWMutex
    Key         string              // series key
    Tags        map[string]string   // tags
    id          uint64              // id
    measurement *Measurement        // measurement
}
Copy the code
shard

Each shard stores data within a specified period of time. For example, data between 8:00 and 8:00 falls into Shard0, and data between 8:00 and 9:00 falls into SharD1. Each shard corresponds to an underlying TSM storage engine and has an independent cache, wal. The TSM file.

Data Retention Policy

Retention policy (RP) is used to define the time that data is stored at the InfluxDB or for a certain period of time. InfluxDB automatically creates an AutoGen (a retention policy with unlimited retention) when you create the database:

> SHOW RETENTION POLICIES ON telegraf name duration shardGroupDuration replicaN default ---- -------- ------------------  -------- ------- autogen 0s 168h0m0s 1true
Copy the code

The preceding statement is used to query the existing database policy. The field meanings of the query result are as follows:

field meaning
name Name of strategy
duration Duration, 0 for unlimited retention
shardGroupDuration ShardGroup is a basic storage structure for the InfluxDB. The time interval of the single shard stored on the table 168H0M0s is 168 hours. After 168 hours, the shard is stored in the next shard
replicaN Full name replication, number of replicas
default Default Policy
func shardGroupDuration(d time.Duration) time.Duration {
    if d >= 180*24*time.Hour || d == 0 { // 6 months or 0
        return 7 * 24 * time.Hour
    } else if d >= 2*24*time.Hour { // 2 days
        return 1 * 24 * time.Hour
    }
    return 1 * time.Hour
}
Copy the code

We can create a new retention policy. The following statement creates a 2-hour retention policy named 2h0m0s in the Telegraf library and sets it as the default policy:

> CREATE RETENTION POLICY "2h0m0s" ON "telegraf" DURATION 2h REPLICATION 1 DEFAULT
> SHOW RETENTION POLICIES ON telegraf
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        false
2h0m0s  2h0m0s   1h0m0s             1        true
Copy the code

Autogen is no longer the default policy. If you want to query the policy data, you need to explicitly add the policy name in the query:

> SELECT time,host,usage_system FROM "autogen".cpu limit2 name: CPU time host usage_system -------- ------------ 1526008670000000000 VM_42_233_centos 1.7262947210419817 1526008670000000000 VM_42_233_centos 1.30130130130254Copy the code

For more information on retention policies, see database_management.

Continuous query

Continuous query (CQ) is a set of statements that are automatically and periodically started in the database. InfulxDB stores the query results in the specified data table.

  • Using continuous query is the best way to reduce the sampling rate, and the combination of continuous query and storage policy will greatly reduce the system usage of InfulxDB.
  • With continuous queries, the data is stored in the specified data table, making it easier to collect data of different precision later.
  • Once created, a continuous query cannot be changed. To change a continuous query, you must first DROP and then reuse CREATE to CREATE a new query.

Here’s the syntax for a continuous query:

CREATE CONTINUOUS QUERY <cq_name> ON <database_name> RESAMPLE EVERY <interval> FOR <interval> BEGIN SELECT <cq_name>function[s]> INTO <destination_measurement> 
  FROM <measurement> [WHERE <stuff>] 
  GROUP BY time(<interval>)[,<tag_key[s]>]
END
Copy the code

For example, the following statement creates a new continuous query named cq_30m in the Telegraf library and adds the average value of used fields to the mem_USed_30m table every 30 minutes. The data retention policy is default:

CREATE CONTINUOUS QUERY cq_30m ON telegraf BEGIN SELECT mean(used) INTO mem_used_30m FROM mem GROUP BY time(30m) END
Copy the code

Here are some common operations:

SQL describe
SHOW CONTINUOUS QUERIES Example Query all CQS
DROP CONTINUOUS QUERY <cq_name> ON <database_name> Deleting a continuous Query

For more information about continuous queries, see continuous_queries.

Commonly used functions

InfluxDB provides a number of useful functions, which fall into three categories:

  • Aggregate class function
function describe
count(field_key) Returns the count
DISTINCT(field_key) Return a unique value
INTEGRAL(field_key) Calculates the area values of surfaces covered by field values and gets the sum of the areas
MEAN(field_key) Return average
MEDIAN(field_key) Return intermediate value
MODE(field_key) Returns the most frequent value in the field
SPREAD(field_key) Returns the maximum difference
SUM(field_key) Returns the sum of
  • Select class function
function describe
BOTTOM(field_key,N) Returns the minimum N values
FIRST(field_key) Returns the oldest value in a field
LAST(field_key) Returns the latest value in a field
MAX(field_key) Returns the maximum value in a field
MIN(field_key) Returns the minimum value in a field
PERCENTILE(field_key,N) Returns the Nth percentile field value.
SAMPLE(field_key,N) Returns a random sample of N fields
TOP(field_key,N) Returns the maximum N values
  • Conversion class function
function describe
CEILING() ~
CUMULATIVE_SUM() ~
DERIVATIVE() ~
DIFFERENCE() ~
ELAPSED() ~
FLOOR() ~
HISTOGRAM() ~
MOVING_AVERAGE() ~
NON_NEGATIVE_DERIVATIVE() ~
NON_NEGATIVE_DIFFERENCE() ~
  • Predict class
function describe
HOLT_WINTERS() Seasonal prediction algorithm – to predict and warn the trend of data traffic

Telegraf

Now that we have the concept of the time series library established, it’s time to write data to the time series library. You can collect metrics from your metrics application and then write data to The Time series library via the HTTP API provided by Influxdb. Instead of using it for the time being (for example, Java metrics should also be used for Java syntax), let’s introduce Telegraf, a data acquisition agent that works well for the influxDB service

Telegraf is an agent written in Go that can be used to collect system and service statistics as part of the TICK technology stack. It has input plugins that can get metrics directly from the system, from third-party apis, and even through StatSD and Kafka. It also has an output plug-in that sends collected metrics to various data stores, services, and message queues. Such as InfluxDB, Graphite, OpenTSDB, Datadog, Librato, Kafka, MQTT, NSQ and so on.

Download and install Telegraf:

Wget https://dl.influxdata.com/telegraf/releases/telegraf-1.6.2-1.x86_64.rpm sudo yum install Telegraf 1.6.2-1. X86_64. RPM telegraf - versionCopy the code

If your Telegraf is installed, its configuration file location is:

/etc/telegraf/telegraf.conf
Copy the code

Edit the profile to specify our configured influxDB database as the desired output source:

[[outputs.influxdb]]
  urls=["http://localhost:8086"]
Copy the code

Starting services and adding startup:

sudo systemctl start telegraf.service
sudo service telegraf status
sudo systemctl enable telegraf.service
Copy the code

Check what data is collected by Telegraf under the default configuration on InfluxDB:

> show databases
> use telegraf
> show measurements
> SHOW FIELD KEYS
Copy the code

How to Configure

By default, the INPUT plug-in in the system category is enabled, that is, telegraf.conf has the following configurations:

# Read metrics about cpu usage
# Read metrics about CPU usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states.
  report_active = false

# Read metrics about disk usage by mount point
# Read metrics about disk usage via mount point
[[inputs.disk]]
  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs"."devtmpfs"."devfs"]

# Read metrics about disk IO by device
# Read disk IO metrics from device
[[inputs.diskio]]

# Get kernel statistics from /proc/stat
# Obtain kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration

# Read metrics about memory usage
# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

# Get the number of processes and group them by status
# Get the number of processes and group them by status
[[inputs.processes]]
  # no configuration

# Read metrics about swap memory usage
Read the metric on swap memory usage
[[inputs.swap]]
  # no configuration

# Read metrics about system load & uptime
Read metrics about system load and uptime
[[inputs.system]]
  # no configuration
Copy the code

The specific data collected are as follows (where the first level is measurements, and the second level is fields (the timestamp field is omitted)) :

- cpu[units: percent (out of 100)]
    - usage_guest      float
    - usage_guest_nice float
    - usage_idle       float
    - usage_iowait     float
    - usage_irq        float
    - usage_nice       float
    - usage_softirq    float
    - usage_steal      float
    - usage_system     float
    - usage_user       float
- disk
    - free         integer
    - inodes_free  integer
    - inodes_total integer
    - inodes_used  integer
    - total        integer
    - used         integer
    - used_percent float
- diskio
    - io_time          integer
    - iops_in_progress integer
    - read_bytes       integer
    - read_time        integer
    - reads            integer
    - weighted_io_time integer
    - write_bytes      integer
    - write_time       integer
    - writes           integer
- kernel
    - boot_time        integer
    - context_switches integer
    - entropy_avail    integer
    - interrupts       integer
    - processes_forked integer
- mem
    - active            integer
    - available         integer
    - available_percent float
    - buffered          integer
    - cached            integer
    - free              integer
    - inactive          integer
    - slab              integer
    - total             integer
    - used              integer
    - used_percent      float
    - wired             integer
- processes
    - blocked       integer
    - dead          integer
    - idle          integer
    - paging        integer
    - running       integer
    - sleeping      integer
    - stopped       integer
    - total         integer
    - total_threads integer
    - unknown       integer
    - zombies       integer
- swap
    - free         integer
    - in           integer
    - out          integer
    - total        integer
    - used         integer
    - used_percent float
- system
    - load1         float
    - load15        float
    - load5         float
    - n_cpus        integer
    - n_users       integer
    - uptime        integer
    - uptime_format string
Copy the code

How to find indicators and collect data

Telegraf consists of input plugins and input plugins, whose source directories correspond to plugins/inputs and plugins/outputs, respectively. All you need to do is refer to the Telegraf official directory to find the plugins you need and then go to the corresponding directory to find the corresponding.md file. Obtain related information as prompted and configure it.

Once the Telegraf service is enabled, you will notice that there is an additional Telegraf library in InfluxDB with multiple measurements, indicating that our data collection has been successful. Once you have the data, what you need to worry about is how to aggregate the data and present it. A visualization suite, Grafana, is described below.


Grafana

Grafana is an open source metric analysis and visualization suite commonly used to visualize performance data for infrastructure and time series data for application analysis. It can also be used in other fields, including industrial sensors, home automation, weather and process control. Note, however, that our primary concern with Grafana was how to aggregate the data for presentation.

Grafana supports many different temporal database data sources. Grafana provides different query methods for each data source and supports the features of each data source well. It supports the following data sources: Graphite, Elasticsearch, CloudWatch, InfluxDB, OpenTSDB, Prometheus, MySQL, Postgres, Microsoft SQL Server (MSSQL). Each data source is documented, and you can combine data from multiple sources into a single dashboard. This article only illustrates the application of the InfluxDB data source.

Download and install Telegraf:

# installation grafanaWget HTTP: / / https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-5.1.2-1.x86_64.rpm# Start service, add boot:
systemctl enable grafana-server
systemctl start grafana-server
# configurationConfiguration file /etc/grafana.grafana.ini Systemd service name grafana-server.service Default log file /var/log/ grafana grafana. The default database log file/var/lib/grafana/grafana dbCopy the code

Simple to use

After the service is started, access http://localhost:3000. The default port is 3000. You can change the port in the configuration file. The default user name and password are admin and admin. After login, configure the data source as prompted:

Then create a dashboard:

We choose the way to preview the effect of import template first, then to understand grafana/dashboard related configuration, here to choose officials to provide a set of Telegraf: the system dashboard, address https://grafana.com/dashboards/928. Please configure your Telegraf as prompted. Then select Import -> upload.jsonFile from your dashboards to import the downloaded template:

View the results:

You can also install plug-ins, such as a time panel plug-in.

To install the grafana-cli plugin, go to your /var/lib/grafana-plugins directory and install the grafana-cli plugin.

> sudo grafana-cli plugins install grafana-clock-panel installing grafana-clock-panel @0.0.9 from URL: https://grafana.com/api/plugins/grafana-clock-panel/versions/0.0.9/download into: New grafana panel successfully Installed in front of new grafana after installing plugins. <service grafana-server restart># Restart service
> sudo systemctl restart grafana-server
Copy the code

Configure a few by yourself

Let’s create a new dashboard. The dashboard consists of multiple rows. Each Row is divided into 12 columns, and we can customize the Span width and height of the panel. Now let’s add a Singlestat and go to Panel Title->Edit to Edit our Panel information. By default, the Metrics view opens and we get the following information:

We modify the units and colors in Options:

Also, you can try adding other panels to achieve the following effect:

Grafana function is very rich, may not be detailed here, please refer to the officer profile to learn more: http://docs.grafana.org/features/datasources/