“Click to get the upper Cloud help document”

Dear Aliyun users:

Hello! In order to facilitate you to try the open source RocketMQ client to access Ali Cloud MQ, we have applied for a special coupon, which can be directly deducted from the amount. Please fill in your company account information and click on the picture above for more information.

1. Introduction purpose of DLedger



Prior to RocketMQ version 4.5, RocketMQ was deployed as a Master/Slave only. There was a single Master in a group of brokers, with zero to multiple slaves. The Slave synchronizes Master data through synchronous or asynchronous replication. The Master/Slave deployment mode provides high availability. But there are drawbacks to this deployment model. For example, in the case of failover, if the primary node fails, it needs to be manually restarted or switched. A secondary node cannot be automatically converted to the primary node. Therefore, we hope to have a new multi-copy architecture to solve this problem.

The new multi-copy architecture first needs to solve the problem of automatic failover, essentially automatic master selection. There are basically two solutions to this problem:

  • Use a cluster of third-party coordination services, such as ZooKeeper or ETCD, for master selection. This solution introduces heavyweight external components that increase deployment, operation, and troubleshooting costs. For example, a RocketMQ cluster needs to be maintained as well as a ZooKeeper cluster, and a ZooKeeper cluster failure can affect the RocketMQ cluster.
  • Automatic master selection is implemented using RAFT protocol. The advantage of RAFT protocol is that there is no need to introduce external components. Automatic master selection logic is integrated into the process of each node and the nodes communicate with each other to complete the master selection.

This is why raft protocol was chosen to solve this problem, and DLedger is a Commitlog repository based on raft protocol that is key to RocketMQ’s new high-availability multi-copy architecture.

Second, DLedger design concept

1. DLedger positioning



Raft protocol is an implementation of replication state machines, and this model is problematic when applied to messaging systems. For a messaging system, it is itself an intermediate proxy, and the Commitlog state is the system’s final state, without requiring the state machine to complete another state build. DLedger therefore removes the raft protocol part of the state machine, but guarantees commitlogs are consistent and highly available based on raft protocol.



DLedger, on the other hand, is a lightweight Java library. It provides a very simple API, append and GET. Append adds data to DLedger, and the added data will correspond to an incrementing index. Get can obtain the corresponding data according to the index. So DLedger is an append only logging system.

2. DLedger application scenario



One of DLedger applications is in a distributed messaging system. After RocketMQ 4.5 is released, RocketMQ on DLedger can be deployed. DLedger Commitlog replaces the original COMMITlog, which makes the Commitlog have the ability of election replication. Then, through role transparent transmission, raft role is passed to the external broker role, and the leader role corresponds to the original master. Followers and candidates correspond to the original slave.

RocketMQ brokers therefore have automatic failover capabilities. In a group of brokers, when the Master hangs, DLedger automatically selects the leader, and then becomes the new Master through role passthrough.



DLedger can also build highly available embedded KV storage. We record the operation of some data in DLedger, and then restore it to HashMap or Rocksdb according to the amount of data or actual demand, so as to build a consistent and highly available KV storage system and apply it to scenarios such as meta-information management.

3. Optimization of DLedger

1. Performance optimization



The Raft protocol replication process is divided into four steps. First, the leader sends a message to the follower. The leader copies the message to the follower in addition to the local storage and waits for the follower to confirm the message. How to optimize the replication process in DLedger?

(1) Asynchronous thread model

DLedger uses an asynchronous thread model, which can reduce waiting. In a system, with fewer choke points and fewer waits per thread to process requests, the CPU is better utilized, improving throughput and performance.



The whole process of DLedger processing Append request is described in DLedger asynchronous thread model. The thick arrows represent RPC requests, the implementation arrows represent data flows, and the dotted lines represent control flows.

First, the client sends the Append request, which is processed by DLedger’s communication module. The current default communication module of DLedger is implemented by Netty, so the Netty IO thread will hand the request to the thread in the service thread pool for processing, and then the IO thread will directly return. Process the next request. The business processing thread processes the Append request in three steps. The first step is to write the Append data to its own log, pagecache. The Append CompletableFuture is then generated and placed into a Pending Map, which is a decision state since the log has not been validated for the most part. The third step is to wake up the EnrtyDispatcher thread and tell it to copy the logs to the followers. After three steps, the business thread is ready to process the next Append request, with almost no waiting in between.

On the other hand, the replication thread EntryDispatcher copies logs to followers. For each follower, an EntryDispatcher thread records the replication site of its own follower. Each time a site moves, a QurumAckChecker thread notifies its urumackchecker thread to determine if a log has been replicated to multiple nodes or if it has been replicated to multiple nodes. And completes the corresponding Append CompletableFuture, notifies the communication module to return the response to the client.

(2) Independent and concurrent replication process



In DLedger, the leader sends logs to all followers independently and concurrently. The leader assigns a thread to each follower to copy the logs and record the corresponding replication sites. A separate asynchronous thread then checks whether the log has been copied to most nodes based on the loci and returns the response to the client.

(3) Parallel log replication



In traditional linear replication, the leader copies the log to the followers and then copies the next entry after the followers confirm the previous entry. In other words, the leader waits for the followers to confirm the previous entry before copying the next entry. In this way, sequential and error-free replication is guaranteed, but the throughput is low and the latency is high. Therefore, DLedger designs and implements a parallel log replication scheme. It no longer needs to wait for the completion of the previous log replication before copying the next log. The follower thread processes these replication requests sequentially in index order. The problem of data loss after parallel replication can be solved by retransmitting a small amount of data.

2. Reliability optimization

(1) Optimization of network partition by DLedger



If the network partition shown in the figure above occurs and N2 is isolated from the other nodes in the cluster, as implemented in raft paper, N2 keeps asking for votes but cannot get a majority of votes and term keeps increasing. Once the network is restored, N2 interrupts n1 and N3, which are replicating normally, to conduct a re-election. To address this situation, DLedger’s implementation improves the RAFT protocol and the request vote process is broken down into phases, with two important phases: WAIT_TO_REVOTE and WAIT_TO_VOTE_NEXT. WAIT_TO_REVOTE is the initial state, which does not add a term when requesting a vote, while WAIT_TO_VOTE_NEXT adds a term before the next request vote begins. For n2 in the figure, when the number of valid votes does not reach a large number. You can set the node state to WAIT_TO_REVOTE, and the term is not increased. Through this method, Dledger’s tolerance to network partitions is improved.

(2) DLedger reliability test

DLedger also has very high fault tolerance. It can tolerate a variety of reasons for a node not working properly, such as:

● Abnormal process crash ● Abnormal machine node crash (machine power failure, operating system crash) ● slow node (Full GC, OOM, etc.) ● Network failure, various network partitions

In order to verify the tolerance of DLedger to these faults, a variety of local tests are carried out on DLedger, and the distributed system authentication and fault injection framework Jepsen is used to detect the problems of DLedger and verify the reliability of the system.



The Jepsen framework is designed to verify that the system is consistent under specific failures. The Jepsen verification system consists of six nodes, one Control Node and five DB nodes. The controller node can log in to the DB node using SSH. After being controlled by the controller node, the distributed system can be downloaded and deployed on the DB node to form a cluster to be tested. After the test starts, the control node will create a group of Worker processes, and each Worker has its own distributed system client. The Generator generates the operations performed by each client, and the client process applies the operations to the distributed system to be tested. The beginning and end of each operation and the result of the operation are recorded in history. Meanwhile, a special Client process, Nemesis, introduces the fault to the system. After the test is complete, the Checker analyzes whether the historical records are correct and consistent.

According to DLedger, it is a Commitlog repository based on raft protocol and an Append only logging system tested using Jepsen’s Set model. The testing process of Set model is divided into two stages. In the first phase, different clients concurrently add different data to the cluster to be tested, with fault injection in between. In the second stage, a final read is made to the cluster to be tested to obtain the read result set. Verify that each successfully added element is in the final result set, and that the final result set contains only the attempted element.

The above figure shows a test result of DLedger. There are 30 client processes concurrently adding data to the DLedger cluster to be tested. In the process, random symmetric network partitions will be introduced. Keep going. The entire phase lasts for a total of 600s. It can be seen that 160,000 data were sent in the end, and no data was lost. Lost-count =0, and no data that should not exist, uexpected-count=0, and the consistency test passed.



The figure above shows each operation of the DLedger cluster by the client in this test. The small blue box indicates that the DLedger cluster is added successfully, the small red box indicates that the DLedger cluster is added failed, the small yellow box indicates that the DLedger cluster is not added successfully (for example, most authentication timeout), and the gray part in the figure indicates the time period when the fault is introduced. It is reasonable to see that some fault periods are introduced and the cluster becomes temporarily unavailable, while others are not. Because of random network isolation, it is necessary to see if isolated nodes cause cluster re-elections. However, even if a cluster re-election is caused, the DLedger cluster will be available again after some time.

In addition to testing symmetric network partition failures, Dledger performance under other failures was also tested, including randomly killing nodes, randomly suspending the processes of some nodes to simulate slow nodes, and complex asymmetric network partitions such as Bridge and partition-majorities ring. Under these faults, DLedger ensures consistency and verifies that DLedger has good reliability.

Iv. Future development of DLedger

DLedger’s next plans include:

● Leader node preference ● RocketMQ on DLedger Jepsen testing ● Runtime member changes ● Adding observers (replication only, not voting) ● Building highly available K/V storage ●…

DLedger is now a project under OpenMessaging that welcomes the community to build a commitlog repository with high availability and performance.

Author information: This article is compiled by Wu Wenliang from the financial tong speech content.


Author of this article: Middleware brother

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.