Service index monitoring is a means to help improve service quality, improve the efficiency of troubleshooting and development, and reduce the cost of errors.

From the design and implementation of service monitoring in the past two days, I found that it is a valuable and skillful thing. How to grab appropriate and favorable monitoring indicators, design intuitive visual monitoring charts and report data structures conducive to troubleshooting are all important.

Based on the case, this paper will analyze how to comprehensively consider the monitoring indicators of services and give practical suggestions.

Technically based on ELK ecology, Grafana visualization. Panel below refers to Grafana’s chart units.

Why monitor services?

Stability is an important factor for an online service to be able to consistently serve the outside world. The oscillation of service invocation times and the change of invocation time can reflect the current status of the service. By monitoring and alarm, developers can view changes and expand and improve services timely.

The visual monitoring can directly reflect the abnormal phenomenon of the service at a certain time. By locating the abnormal phenomenon at a certain time, it can help developers better troubleshoot the problem, find the log of the problem, reduce the time spent in solving the bug, and improve the efficiency of service maintenance.

Service indicator monitoring can reflect user behaviors in real scenarios, bring certain data analysis value, and make it possible to optimize service quality in specific scenarios.

Services are often not independent, and their upstream and downstream situations can reflect the quality of services to a certain extent. Therefore, monitoring their upstream and downstream communication (call time, call results) can also be considered. During this period, communication indicators such as message queue and cache middleware may be involved, which may become performance bottlenecks. Therefore, it is necessary to monitor indicators such as queue backlog, queue message delay, and cache hit.

How to analyze and implement monitoring indicators?

Ditch the useless and grab the observable.

The general direction is to try to make every visual monitor reflect the possibility of a problem.

Simply put, monitoring QPS for service interfaces is the most common, and it is possible to detect issues with interfaces being brushed. Monitoring time changes may detect I/O anomalies.

To list all possible indicators, we can start from some perspectives: for example, we can start from whole to part (if the service is connected with multiple upstream to provide unified service), and we can also analyze it from interface level, upstream and downstream level, quality level and business level respectively.

From whole to part

Example Service Introduction

The list output service is a unified service that outputs the data of multiple lists. The caller can get the data of the list given the name of the list. It serves multiple callers. Because there is only one interface, you can monitor only this interface. In the code, a report log, the IP address of the machine where the service is located, the list name of the request, the interface call time.

Design of monitoring

Following the whole-to-part approach, first we can monitor:

  1. Get the QPS of the list interface
  2. Time taken to obtain the list interface

These two indicators reflect the overall performance of the service. But since there are multiple lists to export, there are several monitors:

  1. The number of requests for each list data
  2. The amount of time each list was requested

The second time monitor uses the quartile: 50\85\95 to reflect both global and local levels.

In the first monitor, multiple curves in a Panel indicate the number of requests for multiple lists. This shows the difference between the lists, but it does not visualize the percentage of total requests invoked by each list. So you can add a monitor:

  1. The percentage of times each list was reviewed over a period of time

This monitoring needs to be represented by the pie chart, which is also the embodiment of the whole to the parts.

In addition to the above monitoring, for each host that provides services, it can also be monitored. The QPS and time consumption of different hosts can be monitored to understand the status of each machine.

From multiple dimensions of the monitored service

The service is introduced

UniteCashOut is a unified cash withdrawal service that provides points for each platform to be converted into cash and withdrawn to users. The points are multi-tiered, and different amounts of points can be converted into different amounts of cash.

The service exposes several interfaces, such as querying information about various stalls, initiating a cash order, confirming and performing a cash order, and querying a cash order.

The service has several upstream streams and one downstream stream, which is used to withdraw a specified amount to the user’s wechat or QQ wallet account.

Not only does the service need to retrieve a message from downstream to determine whether a withdrawal was successful, but it also needs to send a message upstream to inform the withdrawal of success.

Train of thought

From results to implementation

Unlike the previous example, where this service is a multi-interface, full-process, upstream and downstream, multi-access service, there will be more than one report in the code.

We design and implement monitoring with a results-to-implementation approach. This is like designing a product, what you see is what you get, start with the results (monitor visual Panel) and work your way up.

steps

The steps to complete a visual monitoring development can be:

  1. The upstream and downstream, interface, technology and performance of the service are investigated comprehensively.
  2. The information to be displayed in each Panel should be preliminarily drawn up and considered from various dimensions.
  3. Review the proposed Panel proposal with other team members and revise it.
  4. Determine which panels are available and consider what data to report.
  5. Consider the point at which the data should be reported (under what circumstances the data should be reported).
  6. Design the report data structure and implement the coding to report.
  7. Configure the Panel in Grafana.

The third of the above steps is important because it will help you design a valuable visual service monitor that will help you understand the multidimensional dimensions of your project.

What context information should be considered in the fourth and fifth items? It is worth noting that:

Don’t be afraid that reported data may not show up in visualizations. Rich contextual information can help you solve potential problems in the future.

Context information is more important for logging, avoiding multiple log queries when problems occur.

In the sixth step, you need to comprehensively consider all Panel requirements, and design a concise report data point to avoid meaningless multiple reports.

Multi-dimensional design Panel

This section focuses on steps 3 and 4.

The monitoring dimension can be multiple, including interface, upstream and downstream, quality, and service.

Interface dimension

Monitor every interface that is necessary, as much as possible to create a floating interface. If some interfaces are not commonly used, it may not be worth monitoring.

Monitor the interface in terms of QPS and time, where time can be represented by multiple quartiles.

Downstream dimension

Since the service consists of multiple upstream parties, you can customize the Panel for different upstream parties to see how well the service provides services to multiple access parties.

The call to the downstream of the service may directly affect the quality of service, so the result of the call will be a starting point to improve the service, you can monitor the time of the call to the downstream, the number of times, and whether the call is successful. A pie chart reflects the error distribution of the returned results.

In addition, there will be upstream and downstream messaging in this sample service, and message queue latency will also be a monitoring consideration.

The quality of

A process (in this case, from order placing to cash withdrawal) takes a certain amount of time to complete. If the time exceeds expectations, the service quality is poor and must be adjusted. Therefore, the overall process time can be considered as a monitoring indicator.

business

Business indicators are, of course, situational.

In this case, the withdrawal amount is an indicator. If the Panel can reflect that the withdrawal amount exceeds the expectation at a certain time, then there may be a major Bug. If there is a huge difference between today’s withdrawal amount and yesterday’s, then the reason for this phenomenon needs to be analyzed.

Therefore, a Panel can be used to represent the comparison of amounts in the same period, or Panel can be used to express the distribution of withdrawal amounts of different access parties.

Analyze report point and data structure

You should not insert code that reports data until you understand the logic of the code.

But before we do that, there are two important considerations:

  1. The collection of data shall be accurate.
  2. The data contains enough contextual information to be valuable.

To get the data collection right, such as getting all the time information, you need to wrap up the code you are measuring and prevent anomalies from interfering with the time acquisition.

Valuable contextual information, such as the user’s order information, withdrawal amount information, and the user’s IP are all worth including in this example, and Kibana indexing allows you to locate the problem using this contextual information. (This is a good alternative to printing logs.)

Configure the Panel in Grafana

All the previous preparations were for this step of configuration.

In this section, we will explain the basic features of Grafana, which will help you quickly understand its capabilities and get started.

The basic unit of Grafana monitoring is the Panel, which can be understood simply as a visual chart.

Several panels form a Dashboard, that is, a whole monitoring layout.

You can configure multiple queries from the same data source for each Panel, but DashBoard doesn’t require all of your panels to be queried from the same data source.