The author | LingChu Alibaba development engineer

Takeaway: Apache RocketMQ Network Filter support since the end of 19, the Code Review (Pull Request) lasted four months, RocketMQ has been accepted into the CNCF Envoy’s official community this month, making RocketMQ the second middleware product in The country to successfully enter the Service Mesh community after Dubbo.

Sending and receiving messages in the Service Mesh

The main process is as follows:

Briefly describe the process of sending and consuming RocketMQ messages under Service Mesh:

  • Pilot takes routing information from the Topic and sends it to the data plane /Envoy in the form of xDS. The Envoy proxys all network requests that the SDK sends to the Broker/Nameserver;
  • When sending, the Envoy determines that the request is sent through the Request code, selects the corresponding CDS according to the Topic and request code, and selects the corresponding Broker through the load-balancing strategy provided by the Envoy and sends it. Subset of the data plane is used to ensure that the selected Broker is writable.
  • When consuming, the Envoy determines that the request is a consumption through the Request code, selects the corresponding CDS based on the Topic and request code, and selects the corresponding Broker to consume as Envoy (similar to sending, Subset is also used here to ensure that the selected Broker is readable) and the corresponding metadata is recorded. When the MESSAGE consumption SDK sends an ACK request, the corresponding metadata information is extracted for comparison, and then the ACK request is accurately sent to the Broker used in the last consumption through routing.

Problems encountered with RocketMQ Mesh

Service Mesh is often referred to as the next generation of micro-services, which on the one hand reveals that micro-services are the absolute dominant force in the early wave of Mesh. On the other hand, the Mesh of micro-services is relatively more convenient. As message queue and some database products gradually move towards Service Mesh, Each product has its own issues to address along the way, and RocketMQ is no exception.

Stateful network model

RocketMQ’s network model is more complex than RPC and is a set of stateful network interactions, which are mainly reflected in two aspects:

  • RocketMQ’s current network calls are highly dependent on stateful IP;
  • The load balancing at consumption time in the native SDK makes it impossible to ignore the state of each consumer.

For the former, it makes it completely impossible for existing SDKS to use partitioned order messages because IP/(BrokerName + BrokerId) is not contained in the contents of the sending request and consuming request RPCS. As a result, the SDK after using the Mesh cannot ensure that the queues sent and consumed are on the same Broker, that is, the Broker information itself is erased during the Mesh process. Of course, this is not the case for a global sequential message with only one Broker, because the data plane has no other brokers to choose from when the load is balanced, so at the routing level, the global sequential message is not a problem.

In RocketMQ’s Pull/Push Consumer, the Queue is the basic unit of load balancing, In fact, native consumers need to perceive the number of consumers in the same ConsumerGroup and consume in the same Topic. Each Consumer selects the corresponding Queue according to its position to consume. These queues are exclusive to each Consumer in a topi-Consumergroup mapping, which is difficult to achieve on the existing data plane. Moreover, the load balancing on the existing data plane cannot achieve the granularity of Queue. As a result, RocketMQ’s load balancing strategy is no longer applicable to Service Mesh.

At this point we looked at RocketMQ’s HTTP enabled Pop Consumer interface, where each Queue can no longer be exclusive to the current topic-consumerGroup Consumer, Different consumers can consume data in a Queue at the same time, which makes it possible to use the native load-balancing strategy in Envoy.

The right side of Figure 2 shows the Pop Consumer consumption in the Service Mesh. In the Envoy we ignore the Queue sent by the SDK.

Bomb massive Topic routing information

Within the group, Nameserver stores the routing information of topics in GB, which is abstracted into CDS in Mesh. As a result, the control plane can only push CDS in full quantity if the Topic used by the application cannot be known in advance. This will undoubtedly bring tremendous stability pressure to the control plane.

In the earlier years, there are full scale pushes, the control plane sends full xDS messages when the data plane is first activated, and then the control plane can actively control the frequency of the data being sent, but certainly the data is still full scale. Subsequent envoys support part of the Delta xDS API, that is, sending incremental xDS data to the data plane. This of course reduces the volume of incoming data to existing Sidecars, but the xDS data in the Sidecar is still full. Corresponding to RocketMQ, the full CDS information is in memory, which is unacceptable to us. So we want to have on-demand CDS so that Sidecar can just get the CDS he wants. At this point, the Envoy supports Delta CDS, and only this delta xDS. In fact, the xDS protocol that owns Delta CDS already provides on-demand CDS capability, but neither the control plane nor the data plane exposes this capability. So here we modify the Envoy and expose the interface so that the data plane can actively make requests to the control plane for specified CDS and implement a simple control plane in the manner of Delta gRPC. Envoy initiates an unsolicited request for a specified CDS resource and provides a corresponding callback interface to invoke when the resource returns.

The description of On-Demand CDS corresponding to the RocketMQ process is that when the GetTopicRoute or SendMessage request arrives Envoy, Envoy hangs on this process and initiates a request to the corresponding CDS resource in the control plane until the resource returns and restarts the process.

I also launched a Pull Request to the community for the modification of On-Demand CDS, but it seems that the idea at that time was too immature. The reason was that we completely ignored RDS and had a strong binding between CDS and Topic, even with the same name. Senior Maintainer @htuch refuted our idea. In fact, CDS resource names may be loaded with load balancing methods, inbound/outbound and other prefix and suffix, which cannot be directly equivalent to Topic names. More importantly, the definition of CDS given by the community is separate from business. But what we did was tricky and went against the purpose of the community.

Therefore, we need to add RDS for abstraction. RDS locates the specific required CDS name through topic and other information. As a data plane, it is impossible to know the required CDS name at the code level in advance. On-demand CDS by CDS name is out of the question, so from this point on you have to accept the full option, but fortunately it doesn’t affect the code’s contribution to the community.

route_config:
  name: default_route
  routes:
  - match:
      topic:
        exact: mesh
      headers:
        - name: code
          exact_match: 105
    route:
      cluster: foo-v145-acme-tau-beta-lambda
Copy the code

As can be seen from the above, the request for topic named mesh will be routed to the CDS foo-v145-Acme-tau-beta-lambda by RDS. In advance, we only know the topic name, but cannot know the CDS resource name that is matched.

Now from a higher perspective, it is easy to find this mistake, but in fact, we did not correct this problem until the subsequent code review, so we could have done better earlier.

However, based on the current community dynamics, on-demand xDS may already be a roadmap. At least xDS now supports delta and VHDS supports on-demand features for the first time.

What does Mesh bring to RocketMQ?

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It’s responsible for the reliable delivery of requests through the complex topology of services that comprise a modern, cloud native application. In practice, the service mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, without the application needing to be aware.

That’s how William Morgan, who coined the term Service Mesh, defined it: as a network agent, transparent to the user, and responsible as an infrastructure.

Responsibilities include service discovery, load balancing, traffic monitoring, and so on in RocketMQ, reducing the responsibilities of both the caller and the proxied.

Of course, the current RocketMQ Filter makes a lot of concessions to ensure compatibility. For example, in order to ensure that the SDK can successfully obtain the route, the route information is aggregated as TopicRouteData and returned to the SDK. But ideally, The SDK itself doesn’t need to care about routing. The SDK designed for Mesh scenarios is much more streamlined. Rebalance, send and consume no more service discovery. Even in the future, functions such as message body compression and schema validation may be eliminated by SDKS and brokers. Send/consume as you come, and send/consume as you go may be the ultimate form of RocketMQ Mesh.

What’s Next ?

The RocketMQ Filter currently has the ability to send normal messages and Pop consumption, but there are a few features that need to be added if you want a more complete product form:

  • Support for Pull requests: Currently Envoy Proxy only accepts Pop consumption requests, but will consider supporting ordinary Pull requests, which will be rendered unattended by users.
  • Support for global sequential messages: Currently, routing of global ordered messages is not a problem in the Mesh framework, but if multiple consumers consume global ordered messages at the same time, one of them suddenly goes offline and the message does not ACK, the other Consumer’s message will be out of order. This needs to be assured in the Envoy.
  • Proxy on the Broker side: Requests on the Broker side are also brokered and scheduled.

The meandering journey of community

At the beginning, RocketMQ Filter’s initial Pull Request included almost all of the current features, resulting in a very large PR of over 8K lines. Thanks to @Tianqian for his professional work in Code Review, which helped us to integrate into the community faster.

In addition, the Envoy community’s CI is extremely strict, requiring 97% or more single-test line coverage, Bazel source-level dependency, pure static linking, no cache compilation of its own, 24 logical core CPUS and loads loaded in at least half an hour. Community CI runs range from two or three to six or seven hours at a time, and have extremely strict syntax and format requirements for newly submitted code, which makes it possible to change a small part of the code in PR with a large number of single test changes and format requirements. But the good news is that the single test can easily help us find some memory cases. Objectively, the official community requires such high standards to control the coding quality to a large extent. During the process of completing single testing, we found and solved many problems of their own. Generally speaking, these problems are necessary. Debugging and tracing are much more difficult.

Finally, the RocketMQ Filter code was co-written by @Shuda and me. As a person with little open source experience, contributing code to such a popular community was a very valuable experience. I would also like to thank Shuda for his help and advice in this process.

A link to the

  • Official docs for RocketMQ filter
  • Pull request of RocketMQ filter
  • RocketMQ filter’s issue
  • On-demand CDS pull request for Envoy
  • First version of RocketMQ filter’s proposal

Alibaba Cloud native middleware team is recruiting

There are enough business scenarios, big enough messaging ecosystems, and deep enough distributed technologies to explore if you:

  • Proficient in at least one programming language, Java or C++;
  • In-depth understanding of distributed storage theory, micro-service optimization practice;
  • Solid theoretical basis of computer, such as operating system principle, TCP/IP, have a more in-depth understanding;
  • Ability to independently design a production environment with high availability and reliability middleware, such as RocketMQ;
  • Familiar with high concurrency, distributed communication, storage, open source middleware and other related technologies is preferred.

Welcome to join and participate in the construction of a new generation of cloud native middleware! Resume submission address: [email protected].

Course recommended

In order for more developers to enjoy the dividends brought by Serverless, this time, we gathered 10+ Technical experts in the field of Serverless from Alibaba to create the most suitable Serverless open course for developers to learn and use immediately. Easily embrace the new paradigm of cloud computing – Serverless.

Click to free courses: developer.aliyun.com/learning/ro…

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”