Takeaway:

Delivered projects run silky smooth and unimpeded customer experience is good, bug free is probably every keyboard worker’s dream. Then, can we fix the bug before the customer perceiving it? When the bug occurs, for example, we can quickly perceive and locate the problem and fix it in time? Positioning for alarm of quick perception and problem of thesis, we implement the accurate alarm system based on Prometheus, system consists of three parts: logging platform, index system, alarm system, the solution support specified processing quick messages to remind people, and the alarm message with a full range of index information can quickly positioning problem.

Wu Hua Senior Java Development Engineer of NetEase Cloud Business

I. Current situation & problem positioning

You must have been troubled by alarm storms. A large number of alarm information poured in crazily without classification, causing the processing personnel to be unable to quickly determine the location of the problem for a while. They may find out that it was caused by a certain problem after circling around for several times. At present, there are monitoring systems in most running projects. When abnormal occurs, the monitoring system will send a unified alarm message. If the message contains traceId, you can trace the message to the log platform or view specific logs and context information on the ELK. Usually alarm message will be sent to the project team or project group, after the project leader to see people will @ the positioning analysis, location analysis process used along with the problem complexity is proportional to the time, if meets the alarm storm problem such positioning complexity is high, will take longer, this process may have missed favorable recovery time, As time goes on, the number of customers who perceive the problem increases and the scope of impact expands gradually. Our original intention is to fix the problem quickly. We hope to shorten the abnormal perception time and location analysis time as much as possible, and complete the repair before the customer is aware of it, so as not to affect the customer experience as much as possible.

Second,Analysis & Scheme

In general, the following indicators are required to locate faults:

Generally, logs contain but are not limited to the above information, and contain additional information to help locate and rectify faults. Whether the interface responds normally or not, the log information is recorded. The log data generated by the service system every day can reach the level of T. It is unrealistic to process such huge data directly. The purpose of the indicator system is to extract some key information of interest, such as service memory, CPU, GC status, etc. The interface response code is not 200, or the user-defined system exception status code, or other service indicators. The indicator system is the storage, calculation and display of lightweight data to display the aggregated information we need. Depending on the indicator system, we can flexibly configure hotspot data display, trend chart, alarm, etc.

Comparison of current community hot index system:

Because the alarms are concentrated in error or exception information, or the methods lack context, the information is not enough to support the relevant processors to make a judgment as soon as possible. Secondly, sending all the alarm information that has not been classified and sorted out will only disturb the information sorting of the current processors, especially when there is an alarm storm. Frantic alarms tend to lead to misjudgments, which delay processing time. Therefore, we need to collect and sort out error information. When a certain threshold is reached, a message is sent to remind the corresponding business handler, which can be a service group or a single person. The message contains time, machine, service, method, trace, module, exception type and exception information. The alarm information can also be stored in the database, and the response time and processing time can be counted.

After investigation, we decided to use open source Prometheus, which consists of two parts: indicator system and alarm system. It also provides some API interfaces. Counters can be stored in local or remote mode. Because Prometheus loads all data to the memory for data query, it is recommended that data be stored locally for no more than 15 days to avoid server disk shortage or memory explosion due to large data. Data can also be stored in remote databases and supported by indicator databases.

Community remote storage supports:

  • AppOptics: write
  • Chronix: write
  • Cortex: read and write
  • CrateDB: read and write
  • Elasticsearch: write
  • Gnocchi: write
  • Graphite: write
  • InfluxDB: read and write
  • OpenTSDB: write
  • PostgreSQL/TimescaleDB: read and write
  • SignalFx: write
  • Clickhouse: read and write

The indicator system supports PULL&PUSH. If the PULL mode supports flexible job configuration, you can configure REST interfaces and frequencies for pulling target indicators. Prometheus supports hot loading, which means modifications can be made remotely and the configuration takes effect in real time. The indicator system is naturally integrated with the alarm system, which supports indicator configuration of different granularity, alarm frequency configuration, label, etc. Alarm message push supports slack, pin, email, Webhook interface, etc. To ensure the availability of online services, interfaces other than those supported by service functions are not directly opened. On the one hand, service functions may be easily contaminated, and on the other hand, other functions may not affect normal service function support and service performance. System logs are generally collected and stored by other services, such as ELK or other self-developed log platforms. At present, we use the PULL mode to connect to the log platform and provide the index PULL interface in the log platform development. The architecture design is shown in the figure.

Generally, most alarm information is sent in the form of service dimension configuration responsible group, in the form of email or group message. Chinese people’s current working habits are not entirely dependent on email, and their awareness of email message reminder is still low. When there is no designated group message, it is easy to be ignored, which leads to a decrease in the response speed of alarm message. In addition, if the alarm information is not sufficient, it will increase the processing difficulty and further reduce the processing speed.

Therefore, we use Prometheus Alert scheme to send alarm information to Webhook interface of log platform, and log platform selects final message routing destination according to module configuration information.

The complete execution link is as follows:

  • Log platform Collects logs and provides an interface for pulling indicators
  • Prometheus collects indicators
  • Prometheus configures alarm rules and sends alarm messages if the rules match
  • Prometheus Alert sends alarm information to the alarm interface provided by the logging platform
  • The log platform calls Prometheus API to obtain specific indicator information based on the module and indicator name contained in the alarm information
  • The log platform selects the responsible person or responsible group to send messages according to the existing configuration, alarm information module and indicator label

At this point, the whole link process of an alarm is completed.

Three, practice

The implementation steps are as follows:

  • To build a log platform, collect interface or system logs.
  • The log platform opens the indicator pull interface.

  • Configure Prometheus promethe. XML to start Prometheus

The configuration of the collection task is as follows:

-job_name: 'name'# Scrape_interval :1800s Static_configs: -targets :['localhost:9527']Copy the code

Alarm configuration is as follows:

# Alertmanager configurationalerting: alertmanagers:  - static_configs:    -targets: ['localhost:9093']# Load rules once and periodically evaluatethemrule_files:  -"rules.yml"  -"kafka_rules.yml"
Copy the code

Alarm service port 9093 corresponds to the Prometheus Alert service.

The rules file is configured as follows:

groups:- name: kafkaAlert rules: -alert: hukeKfkaDelay expr: count_over_time(kafka_log{appname='appname'}[1m]) > 0 labels: metric_name: kafka module: modulename metric_expr: [1m] Annotations: {{$value}} timesCopy the code
  • Since logs are stored in a Clickhouse database, start the Prometheus2click process to store metric data in Clickhouse for a long time, and the remote configuration interface corresponds to Prometheus2click.
remote_write:  -url: "http://localhost:9201/write"remote_read:  -url: "http://localhost:9201/read"
Copy the code
  • Configure PrometheusAlert to start the ALTER process.
route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1m receiver: 'Web. hook'receivers:- name: 'web.hook' webhook_configs: -url:' External alarm message docking interface 'Copy the code

The alarm service receives alarm information from Prometheus, including configured label information. Alert Selects whether to send messages to three-party interfaces based on frequency and silent configuration.

  • Alarm display & contrast
  • Business report to the police

  • Kafka alarm instance

Four, conclusion

Accurate monitoring and alarm based on Prometheus can effectively avoid alarm storm, improve the response and processing speed of online problems, and effectively reduce the difficulty of troubleshooting problems for students in r&d. Flexible message push for different responsible persons can effectively accelerate the problem perception of corresponding responsible persons and timely respond to and deal with problems. Prometheus’s own indicator collection task, avoids many repeated indicator collection work, perfect integration of alarm system, the current weakness is a little complex configuration is not flexible.

Others interested in Prometheus’ other features can also be found on its website.

The resources

  • Prometheus. Fuckcloudnative. IO/di – yi – zhang…
  • Yunlzheng. Gitbook. IO/Prometheus -…
  • Dockone. IO/article / 100…
  • Prometheus. IO/docs/introd…
  • Segmentfault.com/a/119000001…
  • Github.com/iyacontrol/…

The authors introduce

Wu Hua, senior Java development engineer of NetEase Cloud Business, is responsible for developing and maintaining core modules of cloud business mutual customer system and seven fish work order system.

Related Reading recommendations

  • Brief introduction to the design and implementation of carrier communication center
  • Technology of dry goods | real-time communication services reverberation voice solution algorithm in practice
  • Depth profiling system design | “circle group” news “circle group” technology series