Percentile monitoring of performance monitoring

“This is the 26th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Original link: www.adfpm.com/adf-perform…

One, foreword

What is the best metric to use in performance monitoring – average or percentile? Statistically speaking, there are many ways to determine how good the overall experience an application provides is. Averages are widely used. They are easy to understand and calculate — but they can be misleading. This article is about percentiles. I’ll explain what percentiles are and how you can use them to better understand application performance. The percentile tells us how consistent the application response time is compared to the average. Percentiles are good approximations for trend analysis, SLA monitoring, and daily evaluation/troubleshooting of performance.

A Service level Agreement (SLA), also known as a Service level agreement or a Service level agreement, is a formal commitment defined between a service provider and a customer. The concept of SLA is a guarantee of the availability of web services for Internet companies.

How can averages be misleading

We can draw false conclusions from the average. For example: Let’s assume that the average worker in a country earns around $2,000 a month (which doesn’t seem too bad). However, a closer look reveals that most people in the country are migrant workers, or 9 out of 10 people. They only make about $1,000. One in 10 [local residents] makes around $11,000 a month (that’s too easy, but you get the idea). If you do the math, you will see that the average of this number is indeed around 2000, but we can all understand that this does not represent a realistic “average” salary. This also applies to statistical monitoring of application performance and monitoring of SLAs. Very high values have a very big effect on the average. In reality, most applications have some very important outliers that have a big impact on the average.

Percentile description

Understanding the concept of percentiles is useful when you want a high-level view of how your application is performing. Percentile is a measure used in statistics to indicate that a particular percentage of a set of observations is below that value. For example, the response time of AN HTTP request that falls below the 90% response time value is called the 900 percentile response time. The screenshot below is 3.0 seconds (so 90% of requests are processed in 3.0 seconds or less:

To get a 90 percent response time value for a single click, sort all response time values for the requests that originate from that single click in ascending order. Take the first 90 percent of this group. The maximum response time in this collection is 90 % of the one-click action request.

Suppose there are 10 HTTP response time values available for a one-click operation :1, 2, 3, 4, 5, 6, 7, 8, 9, and 10 seconds. After sorting, if I pull out the 90% response time values as a single set, I get :1, 2, 3, 4, 5, 6, 7, 8, and 9. Here 9 is the maximum value, so it’s 90 percent of the value of the click action.

Of course, we want very fast response times for as many HTTP requests as possible; So, in an ideal world, people in the 50th, 95th, 99th or even the 100th percentile would go as fast as possible.

Four, percentage in performance monitoring

See the percentile chart for the June 2018 monthly overview (bottom right):

Average response times are shown in blue, and the 50th, 90th, and 95th percentiles are plotted in black, gray, and light gray:

The X-axis is the number of days in June 2018, and the Y-axis is the HTTP response time in seconds.

We can see the following pattern:

The response time for the 50th percentile is about 1 second (for a click on a web page).This means that 50% of HTTP requests are processed in 1 second or less.
The 90th percentile is about 2.75 seconds (90% processed in 2.75 seconds)
Maximum at 95th percentile in 3.25 seconds (95% processed in 3.25 seconds)
The average response time is about 2.0 seconds (blue line). The peak on Tuesday (June 5, 12, 19 and 26) was about 2.5 seconds
The average response time on weekends was 1.6 seconds lower than on weekdays (2.0 seconds).
We can see that on Tuesdays, when average response times peak, the 50th, 90th and 95th percentiles are more stable.

What does that tell us?

There may be some very slow requests (peripherals) that have a big impact on the average. In this case, end users run a lot of very slow reports on Tuesday. Tuesday was a kind of “reporting day,” with average response times “chaotic.”
It all depends on our SLA and how well our application has to perform. If there are many HTTP requests with response times between 2.0 and 3.25 seconds that are acceptable for your application or SLA, you’re probably doing just fine. Then, you don’t have to do much more than analyze exceptionally slow requests (5 percent of HTTP requests that take more than 3.25 seconds) and determine if you can speed them up.
If you need to complete most HTTP requests in 2.0 seconds, you need to do a lot of work to optimize your system, because so many requests take more than 2.0 seconds.

Monthly Overview – Active users and sessions

A chart of active end users and HTTP sessions – this is useful for estimating the number of active end users and sessions on a managed server or on all managed servers. Later, we can compare all the other metrics in these value performance monitoring graphs, such as JVM, SLA protocol metrics, time spent in layers, etc., but now we can also compare them to percentages:

The X-axis is the number of days in June 2018, and the Y-axis is the number of active sessions and final users:

We can see the following pattern:

Tuesday is the busiest day for most end users and sessions; We saw peaks on June 5, 12, 19 and 26, 2018
On the busiest day (June 19), there were more than 80 unique HTTP sessions active and 70 unique end users.
Very little end user activity on weekends (approximately 10 individual end users, approximately 15 sessions)

6. Trend analysis

We can use percentiles in all kinds of performance evaluations. Especially for regression and trend analysis after a new release. Did we really improve performance? Sometimes performance goes up or down after a new release — it would be useful to see and recognize this. If so, the 50th, 90th, and 95th percentile lines should decrease after you improve production performance — which means faster response times:

As shown in the figure. A new version was released on June 17 with reportedly improved performance. After that, in the remaining days of June, we saw average response times drop in the 50th, 90th, and 95th percentiles — indicating that the new version did improve performance.

Seven, week, day, hour overview

End user/session and percentage overview by week, day and hour in the same manner as monthly. Here’s an example of a Day overview:

Eight, the conclusion

The percentile tells us how consistent the application response time is compared to the average. This is useful for analyzing performance without being affected by unusually slow requests when the average response time seems very high and the individual data sets seem normal. Percentiles are great for trend analysis, SLA protocol monitoring, and daily performance assessment.

Percentile monitoring of performance monitoring

One, foreword

How can averages be misleading

Percentile description

Four, percentage in performance monitoring

Monthly Overview – Active users and sessions

6. Trend analysis

Seven, week, day, hour overview

Eight, the conclusion

Related Posts

Do you think you really know final?

Web Security Vulnerability SSRF Introduction and solution

Do you really understand the two persistence mechanisms of Redis?