How to implement performance Metrics with Prometheus in the Cloud Native Age

Akiko: Brush clothes

What is performance manometry observable

Observability includes Metrics, Traces, and Logs3. Observability helps us quickly troubleshoot and locate problems in complex distributed systems, and is an essential operation and maintenance tool in distributed systems.

In the field of performance pressure measurement, observable ability is more important. In addition to helping to locate performance problems, the performance indicators of Metrics directly determine whether the pressure measurement is passed and have a decisive influence on the system on-line, as follows:

• Metrics

System performance indicators, including request success rate, system throughput, and response time

Resource performance indicators measure the usage of system hardware and software resources and observe system resource water levels based on system performance indicators

• Logs

Log of the pressure engine to check whether the pressure engine is healthy and whether errors are reported during the pressure test script execution

Sampling logs record details about API requests and responses, helping you check whether some error request parameters are normal during the pressure test, and view complete error information based on response details

• Traces, distributed link tracing for performance problem diagnosis by tracing the call links of a request in the system,

Locate the fault reporting API’s fault reporting system and fault stack to quickly locate performance problems

This article describes how to use Prometheus to implement the observability of performance Metrics.

Core index of pressure measurement monitoring

System Performance Indicators

Pressure monitoring of the three most important indicators: request success rate, service throughput (TPS), request response time (RT), any inflection point of these three indicators, it can be considered that the system has reached the performance bottleneck.

Here mentioned that the response time for the index, judged by the average is misleading, because the response time of a system is not evenly distributed, the long tail phenomenon, often can appear to show the part of the user request response time special long, but the overall average response time in line with expectations, it is actually the affected part of the user’s experience, The test should not be judged to have passed. Therefore, for the response time, the quantile values of 99, 95 and 90 are commonly used to judge whether the system response time is up to the standard.

In addition, if you need to observe the distribution details of request response Time, you can add indicators such as Connect Time and Idle Time.

Resource performance Indicators

During the pressure test, monitoring system hardware, middleware, and database resources is also important, including but not limited to:

• CPU usage • Memory usage • Disk throughput • Network throughput • Database connections • Cache hit ratio… .

Details can be seen in the article “Test Indicators” [1].

Pressure machine performance index

In the pressure measurement link, the performance of the press is easily ignored. In order to ensure that the press is not the performance bottleneck of the whole pressure measurement link, the following performance indicators of the press should be paid attention to:

• Memory usage of the pressure process • CPU usage of the pressure machine, Load1 and Load5 load indicators • Garbage collection times and garbage collection duration need to be paid attention to for the JVM-based pressure engine

Why use Prometheus for pressure monitoring

Open source pressure tools such as JMeter support simple system performance monitoring metrics such as request success rate, system throughput, and response time. However, for large-scale distributed pressure measurement, the native monitoring of open source pressure measurement tools has the following shortcomings:

Monitoring indicators are not comprehensive enough. Generally, they only contain basic system performance indicators and can only be used to judge whether the pressure test is passed. However, if the pressure test fails, the original monitoring index cannot be realized when troubleshooting and locating problems, such as analyzing the 99th bit connection time of an API.
Convergence timeliness is not guaranteed
Large-scale distributed monitoring data aggregation is not supported
Monitoring indicators do not support backtracking by timeline

In summary, native monitoring using open source pressure tools is not recommended for large-scale distributed pressure testing.

The following is a comparison of two open source monitoring schemes:

Plan 1: Zabbix

Zabbix is an early open source distributed monitoring system that supports MySQL or PostgreSQL relational databases as data sources.

For system performance monitoring, the pressure machine needs to provide second-level monitoring index, and the high concurrency of monitoring index writing per second makes the relational database become the bottleneck of monitoring system.

For resource performance monitoring, Zabbix has comprehensive indicators for physical machines and virtual machines, but not enough support for container and elastic computing monitoring.

Scheme 2: Prometheus

Prometheus uses a sequential database as its data source, providing better read and write performance compared to traditional relational databases, and performing well in scenarios where a pressure machine reports a large amount of second-level monitoring data.

For resource performance monitoring, Prometheus is more suitable for cloud resource monitoring, especially for Kubernates and containers monitoring is very comprehensive, for users of cloud native technology, easy to get started.

In summary, Prometheus is more suitable for collecting and aggregating high concurrency monitoring indicators in pressure measurement and for cloud resource monitoring than Zabbix, and is easier to expand.

Of course, it is also a good choice to use mature cloud products, such as pressure tools PTS[2] and observational tools ARMS[3], which are a good match. PTS provides system performance metrics for pressure measurement, while ARMS provides resource monitoring and overall observability capabilities to solve pressure measurement observability problems in one stop.

How to use Prometheus for pressure monitoring

Open source JMeter transformation

Prometheus is a pull data model, so a pressure engine is required to expose HTTP services for Prometheus to obtain pressure indicators.

JMeter provides a plug-in mechanism for customizing plug-ins to extend Prometheus monitoring capabilities. In the custom plug-in, you need to extend JMeter’s BackendListener to update each pressure gauge, such as number of successful requests, number of failed requests, and request response time, as the sampler completes execution. Each pressure measurement index is saved in memory and exposed through HTTP service when Prometheus pulls data. The overall structure is as follows:

The JMeter custom plug-in needs to be modified:

Add a metric registry
Extended Prometheus indicator updater
Implement a custom JMeter BackendListener that calls the Prometheus updater after the sampler execution
HTTP Server implementation, if security needs, add authentication logic

PTS pressure tools

PTS (Performance Testing Service) is an aliyun SaaS Performance Testing tool. PTS supports self-developed pressure measurement engine and open source JMeter pressure measurement. Open pressure measurement indicator to Prometheus on PTS, no need to develop a custom plug-in to modify the engine, just three white screen operations. The specific steps are as follows:

In advanced Settings of PTS manslaughter, turn on the Prometheus switch
After the pressure test starts, copy the Prometheus configuration in Monitor Export with one click
The configuration takes effect after being pasted and hot loaded from self-built Prometheus

Detailed reference: How to Output Indicator Data from PTS Pressure Measurements to Prometheus [4]

Quickly build Grafana to monitor the market

PTS provides an official Grafana platter template [5] that supports one-click import monitoring platters and can be flexibly edited and extended to meet your custom monitoring needs.

The market provides global request success rate, system throughput (TPS), response times in the 99th, 95th, 90th percentile, and error requests aggregated by error status codes.

In the API distribution column, you can intuitively compare the monitoring indicators of each API to quickly locate the performance shortcomings of the API.

In the API Details section, you can view detailed indicators of an API to locate performance bottlenecks.

In addition, the market also provides the pressure machine JVM garbage collection monitoring indicators, can help determine whether the pressure machine is a performance bottleneck in the pressure measurement link.

The import procedure is as follows:

Step one

On the menu bar, click Import under Dashboard:

Step 2

Enter the PTS Dashboard ID: 15981

Select the data source you already have in Prometheus, which in this example is named Prometheus. When selected, click Import to Import

Step 3

After the import, in the upper left corner [PTS Pressure Task], select the pressure task to be monitored, and you can see the current monitoring tray.

This task name corresponds to jobname in the PTS console monitoring export -Prometheus configuration.

conclusion

This paper describes the

What is performance test observable
Why use Prometheus for pressure measurement performance indicator monitoring
How to implement Prometheus based pressure monitoring using open source JMeter and PTS on the cloud

Prometheus is exported from PTS pressure monitoring. It is currently in free public beta and welcome to use it.

At the same time, PTS has a new way to sell, with the price of the basic version dropping by 50%! Million concurrent price only 6200! More new users 0.99 experience version, VPC pressure test exclusive version, welcome to buy!

A link to the

[1] Test index help.aliyun.com/document_de…

PTS www.aliyun.com/product/pts [2] pressure measuring tools

[3] observable tools very different www.aliyun.com/product/arm…

[4] How to output PTS pressure indicator data to Prometheus help.aliyun.com/document_de…

[5] Official Grafana template grafana.com/grafana/das… Release the latest information of cloud native technology, collect the most complete content of cloud native technology, hold cloud native activities and live broadcast regularly, and release ali products and user best practices. Explore the cloud native technology with you and share the cloud native content you need.

Pay attention to [Alibaba Cloud native] public account, get more cloud native real-time information!