The declaration of Japanese

Wonder monitoring system is the largest and most complete monitoring system in 360. Based on the open-Falcon transformation, Wonder has been running for more than a year since its launch in April 2016. From the original “Beggar version” to the current “Wonder Woman version”, Wonder has undergone many changes. Today, I would like to share this PPT from the company: “Wonder Monitoring system practice road”.

PS: Rich first-line technology, diversified forms of expression, all in the”HULK First Line technology chat“, pay attention!

preface

Wonder is a monitoring system developed by ADDOPS and Hulk-Dev based on Open-Falcon. Since Wonder was launched in 360 in April 2016, the number of nodes has exceeded 4W+, and the number of monitoring items collected has exceeded 10 million.

Features:

  • Powerful and flexible data collection

  • Efficient alarm policy management

  • User-friendly alarm Settings

  • Efficient historical data query

  • High availability

Wonder’s improvements to Open-Falcon feature points:

  • Agent Automatic update

  • Live monitoring, port monitoring, and log monitoring

  • Alarm queue control

  • The number of times exceeding the maximum alarm is reported again

  • Linkage with the hardware repair interface automatically disables the alarm

  • Machine room alarm shielding

  • LastEvent state is stored persistently

The status quo

1

Overall Architecture

2

Survival component Sniffer

Sniffer: an independently developed sniffer that can monitor the network and port lives of machines in multiple computer rooms.

The state diagrams collected by two groups of Sniffer-Agent survival components are as follows:

3

Online scale

Online cluster indicators:

Transfer_QPS: 200,000 /S; About 5 minutes 60 million monitoring items Are reported. Monitoring items: 12 million + Occupied storage space: 2.4 TB RRD Archive Storage time: 2 years

4

The data reported

 {    metric: df.bytes.used,    endpoint: w01v.add.bjyt.qihoo.net,    tags: fstype=ext4,mount=/,    value: 1.5,    timestamp: `date +%s`,    counterType: GAUGE,    step: 60 }

Counter (Counter) : Identifies increasing data, such as number of interface accesses, network card traffic. Gauge: The current instantaneous state, which may increase or decrease, such as CPU usage, average latency, etc.

sum(df_bytes_used{fstype=”ext4″,mount=”/”}) by (fstype,mount,hulkid)

5

RRD archive policy

RRA(“AVERAGE”, 0.5, 1, RRA1PointCnt) // 5m a point saves 7dc.RRA(“AVERAGE”, 0.5, 5, RRA(“AVERAGE”, 0.5, 20, RRA20PointCnt) RRA5PointCnt (“AVERAGE”, 0.5, 20, RRA20PointCnt) RRA180PointCnt)// 12 hours 1 point save 2yearc.RRA(“AVERAGE”, 0.5, 720, RRA720PointCnt)

6

Agent Automatic update

Wonder currently undertakes nearly 4W hosts (including those not managed by HULK), so Agent deployment, maintenance and upgrade is not a small amount of work.

  1. Deployment: Install all at once using Qcmd.

  2. Version upgrade: Wonder Agent supports automatic updates

8

Combining with the CMDB

The service tree hierarchy of the private cloud is inherited from the original 360 nodes, primary services, sub-services, and roles. Policies, custom monitoring, and log monitoring are inherited, and subordinates can freely disable inherited policies and add their own policies. User rights: Strong consistency with HULk

9

Alarm set

Alarm mode:

Alarm group Settings: effective for all members Individual alarm mode Settings: effective for the current login user

Adding a policy:

Policy cloning:

Custom monitoring:

Custom monitoring Plugin is more flexible in Wonder. Users can define the name, user, execution command, and unit of the monitoring item.

Log monitoring:

Log Monitoring Logs of application programs are monitored, processed, and reported. Log paths support date function matching.

Survival monitoring:

Survival monitoring as a special basic monitoring has been built in, users only need to configure the policy to alarm.

Port monitoring:

Nginx monitoring:

Nginx monitoring has been added to the Agent as a basic monitoring item of Wonder, supporting access statistics, request time, 499, 50x error statistics, etc. The following interface is displayed.

10

The police statistics

In alarm statistics, users can easily view the status and history of alarms, and can manually ignore alarm events (similar to Zabbix’s ACK feature).

Alarm status:

Alarm History:

11

Application of monitoring

The principle of application monitoring is that the monitoring system periodically accesses the monitoring scripts on the servers where services are deployed and the VIPs, and determines whether an alarm is required based on the information returned by the monitoring scripts. Users only need to write the monitoring scripts, which call the key points to be monitored, and display the returned results according to the specified structure.

You’re doing

What Wonder is currently doing:

  1. LVS traffic mutation monitoring

  2. Data filtering module and cache module

  3. Access Prometheus

  4. Provides user-defined data reporting interfaces

  5. Support to view historical charts in parallel

In the future

In addition to providing safe and reliable alarm, the future direction of Wonder is as follows:

  1. Provide as complete a base of data as possible and support more flexible custom data PUSH

  2. Through data mining, we will judge the overall situation of the business

  3. In terms of intelligent monitoring, we not only do routine alarm prediction (such as disk), but also make efforts towards dynamic threshold, alarm correlation and alarm prediction

  4. We will try to connect more businesses and provide more development services

Scan the QR code below to learn more