1 Deploy the Docker service

curl https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo -o /etc/yum.repos.d/docker.repo yum list docker-ce  --showduplicates | sort -r# Display all versionsYum install - y docker - ce - 20.10.5# Specify docker version
systemctl start docker    Start docker service
systemctl status docker    Check docker service status
systemctl enable docker    Set Docker to start automatically upon startup
Copy the code

2 Deploy the Prometheus service

Create user mon, create directory:

groupadd -g 2000 mon
useradd -u 2000 -g mon mon
mkdir -p /home/mon/prometheus/{etc,data,rules}
Copy the code

Create a configuration file:

vim /home/mon/prometheus/etc/prometheus.yml
Copy the code
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
Copy the code

Start container services:

docker pull prom/prometheus
cd /home/mon/
chown mon. -R prometheus
docker run -d --user root -p 9090:9090 --name prometheus \
    -v /home/mon/prometheus/etc/prometheus.yml:/etc/prometheus/prometheus.yml \
    -v /home/mon/prometheus/rules:/etc/prometheus/rules \
    -v /home/mon/prometheus/data:/data/prometheus \
    prom/prometheus \
    --config.file="/etc/prometheus/prometheus.yml" \
    --storage.tsdb.path="/data/prometheus" \
    --web.listen-address="0.0.0.0:9090"
Copy the code

3 Deploy the Grafana service

Create data directory:

mkdir -p /home/mon/grafana/plugins
Copy the code

Install the plug-in: Download the Grafana plug-in

tar zxf /tmp/grafana-plugins.tar.gz -C /home/mon/grafana/plugins/
chown -R mon. /home/mon/grafana
chmod 777 -R /home/mon/grafana
Copy the code

Start container services:

docker pull grafana/grafana:latest
docker run -d -p 3000:3000 -v /home/mon/grafana:/var/lib/grafana --name=grafana grafana/grafana:latest
Copy the code

4 Configure Grafana to connect to Prometheus

To access http://ip:3000, the initial password of the account is admin/admin, and the password must be changed.

Configure the Prometheus Dashboard in the following sequence:

5 Deploy the Node_Exporter service

I’ll take monitoring an Aliyun ECS as an example.

Install and configure Node_Exporter:

The curl https://github.com/prometheus/node_exporter/releases/download/v1.1.1/node_exporter-1.1.1.linux-amd64.tar.gz > / opt/node_exporter - 1.1.1. Linux - amd64. Tar. Gzcd /opt
tar zxf node_exporter-1.1.1.linux-amd64.tar.gz
mv node_exporter-1.1.1.linux-amd64 node_exporter
Copy the code

Configure the service startup script:

vim /usr/lib/systemd/system/node_exporter.service

[Unit]
Description=node_exporter service
 
[Service]
User=root
ExecStart=/opt/node_exporter/node_exporter
 
TimeoutStopSec=10
Restart=on-failure
RestartSec=5
 
[Install]
WantedBy=multi-user.target
Copy the code
systemctl daemon-reload
systemctl start node_exporter
systemctl status node_exporter
systemctl enable node_exporter
Copy the code

Configure the reverse proxy under Nginx on this ECS server:

vim /usr/local/nginx/conf/conf.d/www.conf

    # prometheus monitor node exporterThe location/node/exporter {proxy_pass http://127.0.0.1:9100/metrics; proxy_set_header Host$http_host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
Copy the code
/usr/local/nginx/sbin/nginx -s reload
Copy the code

Modify the configuration file on the Prometheus server:

vim /home/mon/prometheus/etc/prometheus.yml
        
  - job_name: 'node'
    static_configs:
    - targets: ['www.test.com']
      labels:
        instance: node
    scheme: https
    metrics_path: /node/exporter
Copy the code

Restart Prometheus container:

docker restart prometheus
Copy the code

Check whether access to data, the input to the browser: https://www.test.com/node/exporter:

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 7
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.15.8"1}# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge· · · · · ·Copy the code

Create the Dashboard:

6 Deploy the Alertmanager service

Create directory:

mkdir -p /home/mon/alertmanager
chmod 777 -R /home/mon/alertmanager
Copy the code

Create a configuration file:

vim  /home/mon/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: '[email protected]'
  Note: You need to configure the QQ mailbox authorization code, not the login password. The authorization code can be viewed in the account configuration
  smtp_auth_password: 'abcdefghijklmnop'
  smtp_require_tls: false

route:
  group_by: ['alert_node']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 5m
  receiver: 'email'

receivers:
- name: 'email'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alert_node'.'dev'.'instance']
Copy the code

Pull the image and start the container:

docker pull prom/alertmanager:latest
chown -R mon. alertmanager/
docker run -d --user root -p 9093:9093 --name alertmanager \
    -v /home/mon/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
    prom/alertmanager:latest \
    --config.file="/etc/alertmanager/alertmanager.yml" \
    --web.listen-address="0.0.0.0:9093"
Copy the code

View the IP address of the AlertManager container to configure the Prometheus interconnection interface:

docker exec -it alertmanager /bin/sh -c "ip a"

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
    link/ether 02:42:ac:11:00:04 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.4/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
Copy the code

Modify the Prometheus configuration file to interconnect with the AlertManager:

vim /home/mon/prometheus/etc/prometheus.yml

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 172.17.0.4:9093

rule_files:
  - "/etc/prometheus/rules/*rules.yml"
Copy the code

Configure alarm rules:

vim /home/mon/prometheus/rules/alert-node-rules.yml
Copy the code
groups:
  - name: alert-node
    rules:
    - alert: NodeDown
      # Note: job_name must match that configured in the Prometheus configuration file
      expr: up{job="node-service"} = = 0for: 1m
      labels:
        severity: critical
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} is down"
        description: "Instance: {{ $labels.instance}} has been down for 1 minute"
        value: "{{ $value }}"

    - alert: NodeCpuHigh
      expr: (1 - avg by (instance) (irate(node_cpu_seconds_total{job="node-service",mode="idle"}[5m]))) * 100 > 80
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} CPU usage is too high"
        description: "CPU usage exceeds 80%"
        value: "{{ $value }}"

    - alert: NodeCpuIowaitHigh
      expr: avg by (instance) (irate(node_cpu_seconds_total{job="node-service",mode="iowait"}[5m])) * 100 > 50
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels. Instance}} CPU IOwait usage is too high"
        description: "CPU IOWAIT usage exceeds 50%"
        value: "{{ $value }}"

    - alert: NodeLoad5High
      expr: node_load5 > (count by (instance) (node_cpu_seconds_total{job="node-service",mode='system'})) * 1.2
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} load(5m) is too high"
        description: "Load(5m) is too high, exceeding the number of CPU cores by 1.2 times"
        value: "{{ $value }}"

    - alert: NodeMemoryHigh
      expr: (1 - node_memory_MemAvailable_bytes{job="node-service"} / node_memory_MemTotal_bytes{job="node-service"}) * 100 > 60
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} Memory usage is too high"
        description: "Memory usage exceeds 90%"
        value: "{{ $value }}"

    - alert: NodeDiskRootHigh
      expr: (1 - node_filesystem_avail_bytes{job="node-service",fstype=~"ext.*|xfs",mountpoint ="/"} / node_filesystem_size_bytes{job="node-service",fstype=~"ext.*|xfs",mountpoint ="/"}) * 100 > 90
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} Disk (/ partition) usage is too high"
        description: "Disk(/ partition) usage exceeds 90%"
        value: "{{ $value }}"

    - alert: NodeDiskBootHigh
      expr: (1 - node_filesystem_avail_bytes{job="node-service",fstype=~"ext.*|xfs",mountpoint ="/boot"} / node_filesystem_size_bytes{job="node-service",fstype=~"ext.*|xfs",mountpoint ="/boot"}) * 100 > 80
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} Disk (/boot partition) usage is too high"
        description: "Disk(/boot partition) usage exceeds 80%"
        value: "{{ $value }}"

    - alert: NodeDiskReadHigh
      expr: irate(node_disk_read_bytes_total{job="node-service"}[5m]) > 20 * (1024 ^ 2)
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} Disk read byte rate is too high"
        description: "Disk read byte rate exceeds 20 MB/s"
        value: "{{ $value }}"

    - alert: NodeDiskWriteHigh
      expr: irate(node_disk_written_bytes_total{job="node-service"}[5m]) > 20 * (1024 ^ 2)
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} Disk write rate is too high"
        description: "Disk write rate exceeds 20 MB/s"
        value: "{{ $value }}"

    - alert: NodeDiskReadRateCountHigh
      expr: irate(node_disk_reads_completed_total{job="node-service"}[5m]) > 3000
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels. Instance}} Disk IOPS read rate is too high"
        description: "Disk IOPS Read rate exceeds 3000 IOPS per second"
        value: "{{ $value }}"

    - alert: NodeDiskWriteRateCountHigh
      expr: irate(node_disk_writes_completed_total{job="node-service"}[5m]) > 3000
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels. Instance}} Disk IOPS write rate is too high"
        description: "Disk IOPS Write rate exceeds 3000 IOPS per second"
        value: "{{ $value }}"

    - alert: NodeInodeRootUsedPercentHigh
      expr: (1 - node_filesystem_files_free{job="node-service",fstype=~"ext4|xfs",mountpoint="/"} / node_filesystem_files{job="node-service",fstype=~"ext4|xfs",mountpoint="/"}) * 100 > 80
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} Disk (/ partition) inode usage is too high"
        description: "Disk (/ partition) inode usage exceeds 80%"
        value: "{{ $value }}"

    - alert: NodeInodeBootUsedPercentHigh
      expr: (1 - node_filesystem_files_free{job="node-service",fstype=~"ext4|xfs",mountpoint="/boot"} / node_filesystem_files{job="node-service",fstype=~"ext4|xfs",mountpoint="/boot"}) * 100 > 80
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} Disk (/boot partition) inode usage is too high"
        description: "Disk (/boot partition) inode usage exceeds 80%"
        value: "{{ $value }}"

    - alert: NodeFilefdAllocatedPercentHigh
      expr: node_filefd_allocated{job="node-service"} / node_filefd_maximum{job="node-service"} * 100 > 80
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels. Instance}} Filefd percentage is too high"
        description: "Filefd open percentage over 80%"
        value: "{{ $value }}"

    - alert: NodeNetworkNetinBitRateHigh
      expr: avg by (instance) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]) * 8) > 20 * (1024 ^ 2) * 8
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels. Instance}} Network receiving bit rate is too high"
        description: "The rate at which Network receives bits exceeds 20MB/s"
        value: "{{ $value }}"

    - alert: NodeNetworkNetoutBitRateHigh
      expr: avg by (instance) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]) * 8) > 20 * (1024 ^ 2) * 8
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels. Instance}} Network Sending bit rate is too high"
        description: "Network sending bit rate exceeds 20MB/s"
        value: "{{ $value }}"

    - alert: NodeNetworkNetinPacketErrorRateHigh
      expr: avg by (instance) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) > 15
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} Rate of receiving error packets is too high"
        description: "Network receives error packets at a rate greater than 15 per second"
        value: "{{ $value }}"

    - alert: NodeNetworkNetoutPacketErrorRateHigh
      expr: avg by (instance) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) > 15
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} Rate of sending error packets is too high"
        description: "Network sends error packets at a rate of more than 15 per second"
        value: "{{ $value }}"

    - alert: NodeProcessBlockedHigh
      expr: node_procs_blocked{job="node-service"} > 10
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} Too many tasks are currently blocked"
        description: "The number of tasks currently blocked in Process exceeds 10"
        value: "{{ $value }}"

    - alert: NodeTimeOffsetHigh
      expr: abs(node_timex_offset_seconds{job="node-service"}) > 3 * 60
      for: 2m
      labels:
        severity: info
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance}} time deviation is too large"
        description: "Time deviation of Time node exceeds 3m"
        value: "{{ $value }}"
Copy the code

Restart Prometheus container:

docker restart prometheus
Copy the code

To verify that the alarm email is received, we will close node_exporter and operate on the monitored ECS server:

systemctl stop node_exporter
Copy the code

Then refresh the page of Prometheus and check the Alerts menu, we find that the NodeDown rule is PENDING:

Wait one minute and refresh again. It has become FIRING:

At this point, let’s check our email:

Note The alarm email has been received. Now let’s restore it:

systemctl start node_exporter
Copy the code

Then we received the alarm email that the service was restored:

7 Start management of multiple containers

If multiple containers are configured, you need to modify information such as ports and data storage paths. For example:

Prometheus

docker run -d --user root -p 9091:9090 --name prometheus-poc \
    -v /home/mon/prometheus-poc/etc/prometheus.yml:/etc/prometheus/prometheus.yml \
    -v /home/mon/prometheus-poc/rules:/etc/prometheus/rules \
    -v /home/mon/prometheus-poc/data:/data/prometheus \
    prom/prometheus \
    --config.file="/etc/prometheus/prometheus.yml" \
    --storage.tsdb.path="/data/prometheus" \
    --web.listen-address="0.0.0.0:9090"
Copy the code

Differences:

  • -p 9091:9090
  • –name prometheus-poc
  • -v /home/mon/prometheus-poc/etc/prometheus.yml:/etc/prometheus/prometheus.yml
  • -v /home/mon/prometheus-poc/rules:/etc/prometheus/rules
  • -v /home/mon/prometheus-poc/data:/data/prometheus

Grafana

docker run -d -p 3001:3000 -v /home/mon/grafana-poc:/var/lib/grafana --name=grafana-poc grafana/grafana:latest
Copy the code

Differences:

  • -p 3001:3000
  • –name=grafana-poc
  • -v /home/mon/grafana-poc:/var/lib/grafana

8 References

Prometheus + Grafana Docker deployment

Dashboard Download