Why is TubeMQ, a trillion-level messaging middleware developed by Tencent, donated to Apache?

Introduction | recently, cloud + community technology salon “tencent open source technology” a successful ending. This salon invited a number of Tencent technical experts to discuss with developers about Tencent open source, and revealed Tencent open source projects TencentOS Tiny, TubeMQ, Kona JDK, TARS and MedicalNet in depth. This article is a collation of Zhang Guocheng’s speech.

Main points of this paper:

The principles and features of Message Queue;
TubeMQ related implementation principle and application introduction;
Subsequent development and discussion of TubeMQ.

Message Queue

Message Queue(MQ) is defined in Wikipedia as a mode of communication between different processes or between different threads of the same process.

So why did we adopt MQ? This is determined by the nature of MQ. First, it can integrate many different systems to work together. Second, it can be decoupled for data transfer and processing; The third is that it can do peak buffer processing, we commonly come into contact with Kafka, RocketMQ, Pulsar and other basically have the same characteristics.

What about MQ as a big data scenario? From my personal understanding, yes
High throughput and low latency, the system as stable as possible, the cost as low as possible, the protocol does not need to be particularly complex, especially the horizontal scalability should be as high as possible.

For example, our own production environment may double in a month or a year. If there is no horizontal expansion capacity, the system is prone to various problems.

Second, TubeMQ implementation principle and usage introduction

1. TubeMQ characteristics

So, Tencent research TubeMQ and what kind of characteristics? TubeMQ is a trillion-level distributed messaging middleware that focuses on data transfer and storage over large volumes of data, offering unique advantages in terms of performance, reliability, and cost.

For the application of big data scenarios, we provide a test scheme (the detailed scheme can be found in the Tencent TubeMQ open source Docs directory). First we need to make a practical application scenario definition, under this definition, and then the system data collection. The result: we have throughput of 140,000 TPS and message latency of less than 5ms.

Some may be curious, because there are many studies that have analyzed the same distributed publish-subscribe messaging system, such as Kafka, that can reach millions of TPS. Do our numbers look bad by comparison?

In fact, there is a premise here, that is: we achieved the performance index in the scenario of 1000 topics with 10 partitions configured for each Topic. For Kafka, it might be TPS in the millions, it might be low latency, but in our big data scenario, it’s nowhere near that.

Our system has been running stably online for 7 years. The architecture of the system adopts the thin client and partial server management and control model. Such technical characteristics determine its corresponding application scenarios. For example, real-time advertising recommendation, massive data reporting, indicators & monitoring, flow processing, and data reporting in IOT scenarios.

Partial data loss can be allowed, because it can be solved by repeated reporting if the worst is possible. However, considering that the magnitude of data is too large, part of the reliability can only be sacrificed in exchange for high-performance data access.

2. System architecture of TubeMQ

What is the architecture of the TubeMQ system? As shown in the figure below, it interacts with the outside world through an SDK, although it is also possible to connect directly through the TCP protocol we defined.

It’s worth noting that TubeMQ is written in pure Java, has Master HA coordination nodes, and uses weak ZK to manage offsets. The message storage mode has also been improved, the data reliability scheme has been adjusted, the disk RAID10 multi-copy + fast consumption scheme has been adopted instead of multi-broker node multi-copy scheme, and the metadata self-management mode has been enabled.

3. Research and development of TubeMQ

As shown in the figure below, we have gone through four stages since the data was recorded, from introduction to improvement, to self-research, and now to self-innovation. From the initial 20 billion in June 2013 to 35 trillion in November 2019, it is expected to reach 40 trillion in 2020.

Generally speaking, this is also the important reason why we chose to study from the beginning. Because when the amount of data is small, such as in the order of a billion or less, any MQ can be used. But once the data volume reaches ten billion, ten billion to ten billion, ten billion to one trillion or even more than one trillion, more and more constraints will appear, including the stability and performance of the system, as well as the cost of machine and operation and maintenance.

Our live network TubeMQ has 1500 machines, and the operation and maintenance of 1500 machines only need about one human (two part-time people). For our Broker storage server, the ability to go online for up to four months without a restart is a result of the improvements and innovations we have made to the original base.

4. Horizontal comparison of TubeMQ with other MQ

This table below is a horizontal comparison of TubeMQ with other MQ data. Our project is open source, you can directly verify it if you are interested. In general, TubeMQ stands the test of practice for scenarios that require high performance, low cost and allow for extreme data loss.

Iii. Storage mode and control measures of TubeMQ

The core of MQ is its storage mode, as shown in the figure below. The storage scheme list on the right is provided to us by a Zhihu user named Chen Dabai, and the storage scheme of TubeMQ is on the left.

TubeMQ uses a storage instance scheme to organize memory + files by Topic. Data is written to the main memory first, and when the main memory is full, it is cut to the backup memory, and data is asynchronously brushed from the backup memory to the file to complete data writing. It uses the offset of consumption to determine whether it consumes data from primary or secondary memory or from files, thus reducing the burden on the system and increasing its storage capacity.

As you can see from the storage chart on the right, Kafka, RocketMQ, and JD launched JMQ in 2019, and they’re not that far apart. However, it is important to note that each storage mode has different performance indicators and corresponding levels based on different resource requirements.

Because what we’re doing is compromising the service, and what about compromising the service? That is, we lose data in the event of a power failure, no storage, or no flush, or in the event of a disk failure that RAID10 cannot recover. Except for those two cases, I don’t lose anything else.

Because these two types of failures can occur at any time, TubeMQ is not really suitable for scenarios where persistent repeated consumption requires complete consistency. So why do we do this? Can’t we make multiple copies? It’s not.

The problem is cost. If we do this, if we do a horizontal comparison, do you know how many machines we can save? How much do you save in terms of money?

Here’s a number: On November 8, 2019, LinkIn, the open source Kafka project, published an article about 70,000 data sets on 4,000 machines, which is available online. Another is our domestic big data and application related scenarios company example, using native Kafka to do big data access, at the end of 2018 also reached 7, 8 trillion data volume, spent 1500 mega machines.

Speaking of which, how many machines do we need in this model? We are now at 35 trillion data volumes using 1500 machines, and we are using 1/4, 1/5 as many machines compared to external MQ under the same premise. How much will it be in RMB? The cost of a commercial machine is about 100,000 yuan, and we can save several hundred million yuan only on the cost of the machine, which is why we adopt this plan.

Compared to Kafka asynchronous node replication, we only need about a quarter of the machines. Of course, even with a single copy, our performance is much better than Kafka, and we can save a lot of machines. See our test reports for more information.

All the management and control logic of TubeMQ, including all API interfaces, is done around its storage, including its Topic configuration and flow control processing, data query, API inventory and so on. The following figure shows the core API definition of TubeMQ, which is divided into four sections. If you just use it, you can control it directly from the console, but if you want to fine-tune the system, you need to understand the DEFINITION of the API.

TubeMQ’s management and control module, Master, is based on BDB embedded database for clustered Broker node management. The data store for the Topic information configured by each Broker, as long as it is operated in the red action bar, has a state that tells the operator what process it is currently in, whether it is based on performing the action, read-only, write-only, or read-write. You can also search through this page. This system can run in Windows above, welcome everyone to try.

The authentication and authorization design of TubeMQ is also different from the traditional one, because we have redefined the authentication machine of TubeMQ, as shown in the following figure.

Why open source?

First, based on the requirements of the open source policy of the company: internal and open source cooperation, external technology influence, so we choose open source. Second, based on the information we have, we believe that open source TubeMQ in this area can be of real value to students who need it. Third, we feel that open source is breaking down barriers.

In different corners of the world, a lot of people in the study of this problem, like a parallel universe, everyone in their own universe to research and analysis, not a lot of communication between each other, we believe that someone must do better, than we are worthy of our learning place, so we open it, forming a state, everyone know, can learn from each other This is also a kind of promotion for themselves. Based on the above three points, we finally chose open source.

Why contribute to Apache when it’s already open source? In fact, I also understand that there are many students in the field of development who are afraid to use open source projects, because there are many companies that open source a project, use it and then find no one to maintain it.

In order to solve this problem, we hope to donate it to a neutral foundation and make it a mature project that everyone can accept through its already written standardized process, including its documentation and multiple access conditions. Even if the original team doesn’t pick up the project, someone else will pick it up and keep it moving forward.

So we donated it to the foundation. Why Apache? Since we are MQ focused on big data scenarios, and Apache is the community best known for the big data ecosystem, and we benefit from that ecosystem as well, it was natural to want to give back to the community and donate to Apache. TubeMQ has been an Apache incubator project for some time.

V. Discussion on the follow-up development of TubeMQ

In the first half of 2020, with the coordinated promotion of open source, more and more business data will be accessed internally, and the average daily access will soon exceed 40 trillion.

Our machines will also be upgraded from TS60 to BX2, what changes will it bring? The previous machine was 99% CPU and 30-40% disk IO, but according to the latest test data, on BX2 it changed to 30-40% CPU and 99% disk IO. Therefore, we need to reduce the IO of its disk as much as possible, or choose another more suitable machine, which needs to be studied.

In addition, since we are open source, how to cultivate the community is also a key point. At present, we will open source the project based on the mechanism of collaboration. No matter students inside or outside the company, we will contribute to make the project bigger. We will continue to consolidate this thing in their own areas, so that everyone can use our project according to their own needs.

At the same time, if you find some imperfections in the process of using it, we also hope to contribute through the community, so that we can work together to make this project well.

We’re not just doing MQ, we’re also doing an aggregation layer and an acquisition layer, and on top of that there’s management. Our hope is to open source the whole thing once the MQ piece is stable. We will allow this set of systems to accommodate different MQS, which will be made available to external businesses based on the characteristics of MQ, but unaware of the external business.

Six, Q&A

Q: Mr. Zhang, you just compared TubeMQ to Kafka, and introduced the internal storage structure of TubeMQ, but I found that the internal storage structure of TubeMQ and Kafka are no different, you only have a backup cache, I don’t know why you can just a backup problem Kafka so far?

A: Kafka is Partition based. A Partition has A file block. TubeMQ is topic-based. The second difference is that our memory is in primary/secondary mode. As you mentioned just now, why is an extra memory block faster? It is generally accepted that write memory is faster. If a disk is full, the full disk is cut into the standby block to asynchronously flush files, and then the block of memory is changed to continue writing. In this way, the read/write conflict is much less in the case of master/standby switchover, and the overall performance is faster.

Why did we switch to this storage structure? Like Kafka, at 1000×10 it has become random reads and writes, and the data metrics are not very good at running and are not stable. RocketMQ stores all data in a single file, and each Partition constructs another file. This creates a problem: data files can become write bottlenecks, and the overall system performance will fail when traffic increases.

JMQ defines data files by Topic, but each Partition defines new files. It is a bit broader than RocketMQ in that it does not centralize data into a single file. It addresses some of RokcetMQ’s problems by Topic.

What about TubeMQ? TubeMQ is a Topic a data file, different topics have different files, we do not have Partition. We all define storage units by Topic, a data file + an index file. You can analyze them, they have their own characteristics, the performance characteristics of different scenes are not the same, including your hardware scene, in fact, there are still very different.

introduction

Guo-cheng zhangSenior engineer of Tencent, responsible for the research and development of TubeMQ project since 2015, experienced the system optimization and transformation process of data from trillion to 35 trillion, and has rich project practical experience in the field of mass data access.