Authors: Wenting, Buzhou

Introduction: This article focuses on best practices in online production environments for RocketMQ’s observability tools. RocketMQ is an industry leader in observability. RocketMQ’s Dashboard and message track support core business links, effectively addressing capacity planning, message sending and receiving problems, and custom monitoring scenarios in online mass production.

Introduction to Message Queues

Before entering the topic, first of all, what is the message queue of Ali Cloud?

Ali Cloud provides a rich family of messaging products, and the messaging product matrix covers the fields of Internet, big data, Internet of Things and other business scenarios, providing multi-dimensional and optional messaging solutions for customers on the cloud. No matter what kind of message queue product, the core is to help users solve the asynchronous and decoupling of business and system, as well as cope with the peak load cutting and valley filling in the peak of traffic. Meanwhile, it has the characteristics of distributed, high throughput, low latency and high scalability.

However, different messaging products have different focuses in customer-facing business applications. Simply put, message queue RocketMQ is the preferred message channel in the business world; Kafka is the indispensable messaging product for big data; MQTT is a messaging solution for the Internet of Things; RabbitMQ focuses on the traditional business messaging domain; Cloud native product integration and event flow are accomplished through message queue MNS. Finally, EventBridge is an event hub on ali Cloud, which is unified to build the event center.

This article focuses on the business domain’s preferred channel for messages: message queue RocketMQ. RocketMQ, which was born out of Alibaba’s e-commerce system, features high performance, low latency, peak cutting and valley filling capabilities, and provides rich capabilities to deal with instantaneous traffic peaks in business and messaging scenarios. It is integrated into users’ core business links.

As a message on a core business link, RocketMQ requires very high observability, which allows users to monitor and locate abnormal fluctuations and troubleshoot specific business data problems in a timely manner. As a result, observability is becoming one of the core capabilities of Message queue RocketMQ.

So what is observable capability? The following is a brief introduction to observable capabilities.

Observable capacity

When it comes to observable capabilities, the first three elements that may come to mind are Metrics, Tracing, and Logging.

Combined with the understanding of message queue, the refined explanation of the three elements of observable capability is as follows:

Metrics: Dashborad Specifies the large market

1) Rich index coverage: including message volume, accumulation volume, time consuming at each stage and other indicators, each index is aggregated and displayed from multiple dimensions of instance, Topic and consumption GroupID;

2) Messaging team best practice templates: Provide users with the best templates, especially in complex consuming messaging scenarios, with rich metrics to help quickly locate problems and continuously iterate on updates;

3) Prometheus + Grafana: Prometheus standard data format, using Grafana display, in addition to templates, users can also customize the display of the market.

Tracing: message trace

1) OpenTelemetry Tracing standard: RocketMQ Tracing standard has been incorporated into OpenTelemetry open source standard, specification and rich Messaging Tracing scene definition;

2) Customized display of message field: reorganize abstract request SPAN data according to message dimensions, display one-to-many consumption and multiple consumption information, intuitive and easy to understand;

3) Can connect the upstream and downstream of the tracing link: the tracing of messages can inherit the call context and supplement to the complete call link, and the message link information connects the upstream and downstream of the asynchronous link.

Logging: Client Logging standardization

1) Error Code standardization: different errors have unique Error codes;

2) Error Message integrity: contains complete Error messages and resource information needed for sorting;

3) Error Level standardization: The log levels of different Error messages are refined so that users can configure and monitor alarms more appropriately according to levels such as Error and Warn.

Understanding the basic concepts of message queues and observability, let’s take a look at what sparks when message queue RocketMQ meets observability.

RocketMQ observability tool concept introduction

As you can see from the introduction above, RocketMQ’s observable capabilities can help users identify errors in the production and consumption of messages based on error information. To help you understand the application of this feature, let’s briefly introduce some concepts in the message production and consumption process.

Message production and consumption process concepts

First of all, let’s clarify the following concepts:

  • Topic: message Topic, a first-level message type that classifies messages by Topic.

  • Message: the carrier of information in a Message queue;

  • Broker: A role that stores and forwards messages.

  • Producer: A message Producer, also known as a message publisher, is responsible for producing and sending messages.

  • Consumer: Message consumers, also known as message subscribers, are responsible for receiving and consuming messages.

The process of message production and consumption is simply that producers send messages to the MessageQueue of a topic for storage, and then consumers consume the messages on the MessageQueue. If there are multiple consumers, what is the life cycle of a complete message production?

Here we take timed messages as an example. Producer Producer sends messages to MQ Server after a certain time, and MQ stores the messages in MessageQueue. At this time, there is a storage time in the queue. This time is when the message is ready; After a fixed period of time, the Consumer starts to consume. The Consumer pulls the message from MessageQueue and then reaches the Consumer client after the network time. At this time, it is not low code for consumption, but a process of waiting for the thread of Consumer resources. Wait for the consumer’s thread resources before the actual business message processing begins.

Can be seen from the above introduction, business news time-consuming process to a certain extent, have completed the server returns an ack results, in the whole process of production and consumption, consumption is the most complex process, because of the time, often a source of scenarios, to focus on have a look at the message below scenario said the meaning of each target.

Message accumulation scenario

As shown in the figure above, in the message queue, the messages in gray represent the amount of completed messages, that is, the messages that the consumer has processed and returned ack. The messages in orange indicate that the messages have been pulled to the consumer client and are being processed, but there is no message that has returned the result of processing. This message actually has a very important indicator, which is the message processing time. Finally, the green messages indicate that the messages have been stored and completed in the MQ queue where they occurred and are in a state available for consumer consumption, known as ready messages.

Ready messages:

Meaning: The number of messages for which messages are ready.

Effect: The size of the message volume reflects the size of the messages that have not yet been consumed. In the case of a consumer exception, the number of ready messages increases.

Queue Time

Meaning: The difference between the ready time of the earliest ready message and the current time.

Role: This size reflects the latency of messages that have not yet been processed, which is an important metric for time-sensitive businesses.

Features of RocketMQ’s observability tool

In conjunction with the RocketMQ observability concept for message queues introduced above, the following details two core functions of the RocketMQ observability tool.

Observable function – Dashboard

Dashboard allows you to view the specified indicator data based on various parameters. The main indicator data includes the following three points:

Overview:

  • View the total number of sent and received messages, TPS, and message type distribution of instance data.
  • Check the current distribution and ordering of each indicator: Topic with the most messages sent, GroupID with the most messages consumed, GroupID with the most messages accumulated, GroupID with the longest queuing time, etc.

2) Topic (message sending) :

  • View the sent message volume graph for a given Topic.
  • View the send success rate graph for a given Topic.
  • View the send time graph for the specified Topic.

3) GroupID (message consumption) :

  • View the message volume graph for the specified Group subscribing to the specified Topic.
  • View the consumption success rate of the specified Group subscription for the specified Topic.
  • View indicators such as the consumption time of a specified Group and a specified Topic.
  • View message heap metrics for the specified Group subscriptions for the specified Topic.

Observable function introduction – Message trace

Tracing provides the message trace function, which mainly includes the following three capabilities:

1) Convenient query ability: relevant track can be queried according to the basic message information; In the second phase, the query can be filtered according to the result status and time-consuming time, and the effective track can be filtered to quickly locate the problem.

2) Detailed tracing information: In addition to time and time consuming data of each life cycle, there are also accounts and machine information of producers and consumers.

3) Optimized display effect: trajectory of different message types; Multiple consumption GroupID scenarios; The same consumption GroupID for multiple recasting scenes, etc.

Best practices

Scenario 1: Troubleshooting

1) Objective: health of message production and consumption

2) principles

  • Level 1 indicators: indicators used for alarm, recognized indicators without objection.

  • Level-2 indicators: When level-1 indicators change, you can quickly locate the cause of the fault by viewing the level-2 indicators.

  • Third-level index: locate the causes of second-level index fluctuations. Add according to their business characteristics and experience.

Based on the goals and principles, producer and consumer users are identified and analyzed in the following ways:

Scenario 2: Capacity planning

In capacity planning scenarios, you need to solve the following three problems:

1) Question 1: How to evaluate instance capacity?

Solutions:

  • Instance Details page You can view the TPS peak value of the maximum message sent and received within the selected period.

  • Platinum edition instances can use this data to add alarm monitoring and determine business.

2) Question 2: How to check the consumption of standard edition instances

Solutions:

  • You can view the Overview total message volume module

3) Question 3: Which offline resources need to be cleared?

Solutions:

  • Within a specified period of time (for example, last week), sort the number of messages sent by Topic in ascending order to see if there are any topics whose message sent volume is 0. The services related to these topics may have gone offline.

  • Within a specified period of time (for example, last week), sort the message consumption of GroupID from small to large, and check whether there are any GroupID whose message consumption is 0. The services related to these GroupID may be offline.

Scenario 3: Service Planning

The following problems are solved in service planning scenarios:

1) Problem 1: How to view the distribution of business peak value?

Solutions:

  • View the peak time of day for Topic message reception.
  • View the difference in the number of received messages between a Topic on a weekend and a non-weekend.
  • View the changes in the number of Topic messages received during holidays.

2) Question 2: How to judge which businesses have an upward trend at present?

Solutions:

  • Viewing the message volume helps determine the service traffic trend.

3) Question 3: How to optimize the performance of the consumer system?

Solutions:

  • Look at the message processing time and determine if there is room for improvement within a reasonable range.

This article introduces message queues, observability, RocketMQ observability concepts and features, and best practices to demonstrate the visualability of RocketMQ observability tools on core business links. We hope to provide you with some help in troubleshooting and maintenance problems online.

Click here to experience RocketMQ’s observability tools.