Brief introduction:As the company’s business continues to grow, so does its traffic. We found that some major accidents in production are often rushed across by sudden flow. It is particularly important to control and protect the flow and ensure the high availability of the system.

The author was | liang yong


Hello has evolved into a comprehensive mobile travel platform including two-wheeled travel (hello bicycle, hello moped, hello electric car, hello power change) and four-wheeled travel (hello hitch-ride, online car-hailing, hello taxi-hailing), etc., and has explored many local life-oriented ecology such as hotels and in-store group buying. As the company’s business continues to grow, so does its traffic. We found that some major accidents in production are often rushed across by sudden flow. It is particularly important to control and protect the flow and ensure the high availability of the system. This article shares Hello’s experience in the governance of message traffic and micro-service invocation.

The authors introduce

Liang Yong (Lao Liang), co-author of the column RocketMQ Practics and Advancement, participated in the review work of RocketMQ Technology Insider. Lecturer at ArchSummit Global Architect Conference and QCON Case Study Club.

At present, mainly in the direction of back-end middleware, in the public number [Guonong Laoliang] has been published more than a hundred source code practical class articles, covering RocketMQ series, Kafka series, GRPC series, NaCOSL series, Sentinel series, Java NIO series. At present, I am working as a senior technical expert in Hello Chuxing.

Talk about governance

Before we start, we will talk about governance first. Here is Lao Liang’s personal understanding:

What is governance doing?

  • Let’s make our environment better

Need to know what’s not good enough?

  • Past experience
  • User feedback
  • The contrast

You also need to know is it always good?

  • Monitor the track
  • Warning notice

How do you make it better when it’s bad?

  • Control measures
  • The emergency plan


  1. Build a distributed message governance platform
  2. RocketMQ combat pits and solutions
  3. Create a governance platform for high availability of micro services


Streaking of the RabbitMQ

Companies have previously used RabbitMQ, and the following are pain points when using RabbitMQ. Many of these incidents are caused by RabbitMQ cluster traffic restrictions.

  • Excessive backlog should be cleared or not cleared? That’s a question. Let me think about it again.
  • Excessive backlog triggers cluster flow control? That’s really affecting the business.
  • Want to consume the last two days of data? Please send it again.
  • Which services should be counted? You have to wait a little longer. I have to go and get the IP.
  • Do you use risks such as big news? Let me guess.

Streaking service

There was a failure where multiple businesses shared the same database. During one evening rush hour, traffic spiked and the database crashed.

  • Database stand-alone upgrade to the highest match still cannot be solved
  • Restart after a slow, and soon be hit again
  • So cycle, suffering, silently waiting for the peak past

Consideration: Both messages and services need good governance measures

Build a distributed message governance platform

Design guidelines

What are our key metrics and what are our secondary metrics is the primary issue of message governance.

The design goal is to mask the complexity of the underlying middleware (RocketMQ/Kafka) and dynamically route messages with a unique identity. At the same time, a message governance platform integrating resource management and control, retrieval, monitoring, warning, patrol inspection, disaster recovery, visual operation and maintenance is built to ensure the stable and healthy operation of messaging middleware.

Points to consider in message governance platform design

  • Provides an easy-to-use API
  • What are the key points to measure whether a client is being used without security concerns
  • What are the key indicators to measure the health of a cluster
  • What are the common user/operations operations to visualize this
  • What measures are in place to deal with these unhealthiness

Make it as simple and easy to use as possible

Design guidelines

Make complex problems simple, that is the ability.

Minimalist unified API

Providing a unified SDK encapsulates two kinds of messaging middleware (Kafka/RocketMQ).

An application

The automatic creation of topic consumption groups is not suitable for production environment, and automatic creation can lead to runaway, which is not conducive to the whole lifecycle management and cluster stability. The application process needs to be controlled, but it should be as simple as possible. For example: apply for each environment to take effect at one time, generate associated warning rules, etc.

Client Governance

Design guidelines

Monitor client usage to find appropriate governance measures

Scene: the playback

Scenario 1 Instantaneous flow and cluster flow control

Assuming that the TPS of the cluster is now 10,000, and the instantaneous turnover is 20,000 or even more, such excessive and steep increase of traffic is highly likely to cause cluster flow control. For this kind of scene, the sending speed of the client needs to be monitored. After the speed and steep amplitude threshold are met, the sending will become a little slower.

Scenario 2 Big message and cluster jitter

When clients send large messages, such as hundreds of kilobytes or even several megabytes, it can cause long IO times and cluster jitter. For this kind of scenario, governance needs to monitor the size of the message sent. We adopt the way of post-inspection to identify the service of large message, promote the use of classmate compression or reconstruction, and control the message within 10KB.

Scenario 3 too low client version

As the version of the SDK is upgraded over the iteration of functionality, changes in addition to functionality may introduce risks. When using too low a version one is that functionality is not supported and the other is that there may be a security risk. In order to know the usage of SDK, you can report the version of SDK and promote users to upgrade through inspection.

Scenario 4 Consumption traffic removal and recovery

Consumption traffic removal and recovery usually have the following use scenarios, the first is the release of the application need to remove traffic, the other is the problem location want to remove traffic before troubleshooting. To support this scenario, you need to listen on the client side for removal/recovery events that will consume pause and restore.

Scenario 5 Send/consume time detection

How long does it take to send/consume a message? By monitoring the time consumption, patrolling and sorting out the applications with low performance, we can promote the targeted transformation to achieve the purpose of improving performance.

Scenario 6: Improve the efficiency of troubleshooting and positioning

When troubleshooting problems, it is often necessary to retrieve information related to the message life cycle, such as what messages were sent, where they exist, and when they were consumed. This part can be used to chain the life cycle within the message via MSGID. In addition, messages are chained together in one request by embedding a link identifier like RPCID/TRACEID in the message header.

Treatment measures refining

Monitoring information required

  • Sending/consuming speed
  • Time to send/consume
  • Message size
  • Node information
  • Link id
  • Version information

Common management measures

  • Regular Inspection: With buried point information, risky applications can be found through inspection. For example, send/consume time greater than 800 ms, message size greater than 10 KB, version smaller than a specific version, etc.
  • Send smooth: such as the detection of instantaneous flow to meet 10 thousand and a steep increase of more than 2 times, can be preheated by the way of instantaneous flow to become smooth some.
  • Consumption limit: When the third party interface needs to limit the flow, it can limit the flow of consumption. This part can be implemented in combination with the high availability framework.
  • Consuming Extract: The Consuming Client is shut down and restored by listening for an extract event.

Subject/consumer group governance

Design guidelines

Monitor topic consumption group resource usage

Scene: the playback

Some business scenarios are sensitive to the build-up of consumption, while others are insensitive to the backlog, as long as they catch up and consume. For example, unlocking a bike is a second event, whereas batch scenarios related to information aggregation are not sensitive to backlogs. By collecting the consumption backlog index, the students in charge of the application will be notified to the students in charge of the application in the way of real-time warning for the applications that meet the threshold, so that they can grasp the consumption situation in real time. Scenario 2 Impact of consume/consume speed Alarm when send/consume speed drops to zero In some scenarios, the speed cannot fall to zero, because if it falls to zero, it means that the business is abnormal. By collecting speed index, the application of real-time alarm to meet the threshold value. Scene III Disconnection of a consumer node The student responsible for the application should be notified of the disconnection of a consumer node. In this case, the information of the registered node should be collected. When the disconnection occurs, an alarm notification can be triggered in real time. Scenario 4 Send/Consumption Imbalance A send/consumption imbalance often affects performance. I remember one time when I consulted a student, he set the key of the message to be constant, and the default was to hash the message according to the key and select the partition. All the messages entered a partition, and this performance could not be achieved in any case. In addition, it also detects the consumption backlog of each partition and triggers a real-time alert notification when there is excessive imbalance.

Treatment measures refining

Monitoring information required

  • Sending/consuming speed
  • Send partition details
  • Consumption is overstocked in each area
  • Backlog of consumption group
  • Registered Node Information

Common management measures

  • Real-time alerts: real-time alerts for consumption backlog, send/consume speed, node drop, partition imbalance.
  • Improve performance: for the consumption backlog can not meet the demand, you can increase the pull thread, consumer thread, increase the number of partitions and other measures to improve.
  • Self-service troubleshooting: Provides multi-dimensional retrieval tools, such as multi-dimensional retrieval of message life cycle through time range, MSGID retrieval, link system, etc.

Cluster health governance

Design guidelines

What are the core metrics for measuring cluster health?

Scene: the playback

Scenario 1 Cluster health check Cluster health check answers the question whether the cluster is good or not. This problem is solved by detecting the number of nodes in the cluster, the heartbeat of each node in the cluster, the TPS level written by the cluster, and the TPS level consumed by the cluster. Cluster flow control often reflects the lack of cluster performance, cluster jitter can also cause client sending timeout. The stability of the cluster can be mastered by collecting the time consumption of heartbeat of each node in the cluster and the change rate of TPS water level written in the cluster. High Availability of Cluster Scenario 3 High Availability of Cluster is targeted to extreme scenarios where an area of availability is unavailable, or where some topic or consumer group exception on the cluster requires some specific action. For example, MQ can be solved by means of cross-deployment of master and slave across available areas in the same city, dynamic migration of topics and consumer groups to disaster preparedness cluster, multi-activity, etc.

Treatment measures refining

Monitoring information required

  • Collection of cluster node number
  • Cluster node heartbeat takes time
  • The cluster writes the water level of the TPS
  • Cluster consumption TPS water level
  • The rate of change of cluster writes to TPS

Common management measures

  • Regular inspection: regular inspection of cluster TPS water level and hardware water level.
  • Disaster tolerance measures: cross-deployment of master and subordinate across available areas in the same city, dynamic migration of disaster tolerance to disaster preparedness cluster, and more activities in different places.
  • Cluster tuning: System version/parameter tuning, cluster parameter tuning.
  • Cluster classification: by line of business and by core/non-core services.

Focusing on the most core indicators

If which of these key indicators is the most important? I will select the heartbeat detection of each node in the cluster, that is, response time (RT). Let’s take a look at the possible reasons that may affect RT.

About the alarm

  • Monitoring indicators are mostly second – level detection
  • Alarm triggering threshold is pushed to the company’s unified alarm system and real-time notification
  • The risk notice of patrol inspection should be pushed to the company’s patrol inspection system and summarized weekly

Message Platform Diagram

Architecture diagram

Kanban graphic

  • Multidimensional: Cluster dimension, application dimension
  • Total aggregation: key index total aggregation

RocketMQ in the field of the pit and solutions


We always have potholes, so we fill them in.

1. RocketMQ cluster CPU burr

Problem description


RocketMQ has experienced frequent CPU spikes on both slave and master nodes, with noticeable burrs, and many times the slave nodes have simply failed.

Only the system log has an error message

The 2020-03-16 T17:56:07. 505715 + 08:00 VECS0xxxx kernel: []? \ _ \ _alloc \ _pages \ _nodemask + 0 x7e1/0 x9602020 t17-03-16:56:07. 505717 + 08:00 VECS0xxxx kernel: Java: Order :0, mode:0x202020-03-16T17:56:07.505719+08:00 vecs0XXXX kernel: Pid: 12845, comm: Java Not tainted 2.6.32-754.17.1.el6.x86\_64 #12020-03-16T17:56:07.505721+08:00 VECS0xxxx kernel: Call Trace:2020-03-16T17:56:07.505724+08:00 vecs0XXXX kernel:[]? \_\_alloc\_pages\_nodemask+0x7e1/0x9602020-03-16T17:56:07.505726+08:00 VECS0xxxx kernel: []? Dev \_queue\_xmit+ 0xD0 /0x3602020-03-16T17:56:07.505729+08:00 VECS0xxxx kernel: []? IP \_finish\_output+0x192/0x3802020-03-16T17:56:07.505732+08:00 vecs0XXXX kernel: []?

Various debugging system parameters can only slow but not eradicate the burr, which still exceeds 50%

The solution

Upgrade all the cluster systems from CentOS 6 to CentOS 7, and upgrade the kernel version from 2.6 to 3.10. The CPU burr is gone.

2. RocketMQ cluster online delay message failure

Problem description

RocketMQ Community Edition supports 18 delay levels by default, each of which is consumed by the client at the specified time. For this purpose, the interval of consumption has been specially tested, and the test results show that it is very accurate. However, such an accurate feature unexpectedly out of the question, received a business classmate report online a cluster of delayed message consumption is less than, weird!

The solution

Moving “delayOffset. Json” and “consumequeue/SCHEDULE\_TOPIC\_XXXX” to another directory is equivalent to deleting; Restart the Broker nodes one by one. After the restart, it is verified that the delayed message function is sent and consumed normally.

Create a governance platform for high availability of micro services

Design guidelines

What are our core services and what are our non-core services is the first issue of service governance

Design goals

Services can cope with sudden and steep increase in traffic, especially to ensure the smooth operation of core services.

Application sizing and grouping deployment

Application of classification

Applications are graded into four tiers based on two dimensions of user and business impact.

  • Business impact: The scope of business affected when a failure is applied
  • User impact: The number of users affected when an application failure occurs

S1: The failure of core products will cause the external users to be unable to use or cause large capital loss, such as the core links of the main business, such as the core links of issuing and receiving orders of bicycles, moped switches and locks and hitch rides, as well as the applications strongly dependent on their core links. S2: It does not directly affect transactions, but it is related to the management and maintenance of important configuration of foreground business or functions of background business processing. S3: The service failure has very little impact on users or core product logic, and has no impact on the main business, or a small amount of new business; Important tools for internal use do not directly affect the business, but related management functions have a small impact on the front office business. S4: System for internal users that does not directly affect the business, or that needs to be taken offline later.

Grouping deployment

S1 service is the core service of the company, and it is the key object of protection. It needs to be protected from accidental impact of non-core service traffic.

  • The S1 service is grouped and deployed, divided into two sets of environments: Stable and Standalone
  • Non-core services call S1 service traffic routing to the Standalone environment
  • The S1 service calls a non-core service to configure a circuit breaker policy

A variety of current limiting fuse capacity construction

We build high availability platform capabilities

Part of the current limiting effect


  • Preheat graphic

  • Waiting in line

  • Preheating + queuing

High availability platform illustration


  • All middleware access
  • Dynamic configuration takes effect in real time
  • Detailed traffic for each resource and IP node


  • What are our key metrics and what are our secondary metrics is the primary issue of message governance
  • What are our core services and what are our non-core services is the first issue of service governance
  • Source & actual combat is a better way to work and learn.

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.