Introduction | there are a lot of message middleware, the rabbitMQ, kafka, rocketMQ, pulsar, redis, etc., more dazzling. What are the similarities and differences, and which one should you pick? This article attempts to take a step-by-step approach to technology evolution, using Redis, Kafka, and Pulsar as examples, to explain their architecture and principles to help you better understand and learn message queues. Author: Liu Deen, Tencent IEG r&d engineer.

One, the most basic queue

The most basic message queue is actually a two-ended queue, which can be implemented using a bidirectional linked list, as shown in the following figure:

Push_front: adds elements to queue head; Pop_ tail: Retrieves elements from the tail of the queue.

With such a data structure, we can build a queue of messages in memory, with some tasks constantly adding messages to the queue, while others constantly fetching them in an orderly fashion from the end of the queue. The task of adding messages is called producer, while the task of pulling and consuming messages is called consumer.

Implementing such an in-memory message queue is not difficult, even easy. But it will take a lot of work to keep it efficient against the massive amount of concurrent reads and writes.

Second, the Redis queue

Redis provides just such a data structure, the list. Redis List supports:

Lpush: Inserts data from the left side of the queue; Rpop: Fetch data from the right side of the queue.

This corresponds to our queue abstractions push_front and pop_tail, so we can use Redis’s list as a message queue directly. And Redis itself is well optimized for high concurrency, with its internal data structures carefully designed and optimized. So in a sense, using Redis’s list is probably much better than re-implementing a list yourself.

On the other hand, using redis list as a message queue has some disadvantages, such as:

Message persistence: Redis is an in-memory database, and although there are two mechanisms for persistence, AOF and RDB, these are only auxiliary tools, both of which are unreliable. When the Redis server is down, some data will be lost, which is unacceptable for many businesses. Hot key performance issues: Whether clustered with CODIS or TwemProxy, read and write requests to a queue end up on the same Redis instance and cannot be addressed by capacity expansion. If the number of concurrent reads and writes to a list is very high, unresolvable hot keys are created, which can seriously crash the system. There is no validation: every time rPOP consumes a piece of data, that message is permanently deleted from the list. If consumers fail to spend, there is no way to recover that message. You might say that the consumer could repost this message to the queue in the event of a failure, but that would be ideal, extreme in case the consumer process crashes directly, such as by kill -9, Panic, Coredump… Multiple subscribers are not supported: a message can only be consumed by one consumer, and after RPOP it is gone. If the queue stores application logs, for the same message, monitoring system needs to consume it for possible alarm, BI system needs to consume it to draw reports, link tracing needs to consume it to draw call relationships… This scenario is not supported by Redis List. No second consumption: a message disappears after rPOP. If a consumer program is halfway through and finds a bug in the code, it can’t be fixed and wants to start again.

For now, the first one (persistence) seems to be manageable. Many companies have teams working on secondary development based on Rocksdb LevelDB to implement KV storage that supports redis protocol. These stores are not Redis anymore, but they work almost the same as Redis. They are good for data persistence, but not for the other pitfalls mentioned above.

Redis 5.0 has added a stream data type, which is a data structure designed to be a message queue. It borrowed a lot from Kafka’s design, but there are still a lot of problems. Straight into the world of Kafka doesn’t it smell?

Third, Kafka

As you can see from the above, a true messaging middleware is more than just a queue. Especially when it carries a large number of businesses of the company, its functional integrity, throughput, stability, scalability have very demanding requirements. Here comes Kafka, a system designed specifically for messaging middleware.

There are many disadvantages of Redis list, but if you think about it carefully, it can be summarized into two points:

The hot key problem cannot be solved, that is, the performance problem cannot be solved by adding machines; Data is deleted: it’s gone after RPOP, so you can’t satisfy multiple subscribers, you can’t start over, you can’t ack.

These two issues are at the heart of Kafka’s mission.

The inherent problem with hot keys is that the data is concentrated on one instance, so find a way to spread it across multiple machines. To this end, Kafka proposed the concept of partition. A queue (a list in Redis) corresponds to a topic in Kafka. Kafka breaks up a topic into partitions, and each partition can be distributed across different machines, thus spreading the stress of a single machine over multiple machines. So topic is more of a logical concept in Kafka, where the actual storage units are partitions.

The Redis list can do the same, but it requires additional logic in the business code. For example, you can create n lists, key1, key2… Keyn, the client can push a different key each time, and the consumer can simultaneously consume data from the n list key1 to keyn. This can achieve kafka multi-partition effect. So as you can see, partition is a very naive concept for spreading requests across multiple machines.

The other big problem with Redis lists is that rPOP deletes data, so kafka’s solution is simple: don’t delete it. Kafka solves this problem with cursors.

Unlike Redis lists, kafka topics (actually partion) store data in a one-way queue, with new data appended directly to the end of the queue each time. It also maintains a cursor that, starting at the beginning, points each time to the index of the data to be consumed. For each cursor consumed, cursor+1. In this way, Kafka can implement the same first-in, first-out semantics as Redis List, but kafka only updates the cursor each time and does not delete the data.

The benefits of this design are numerous, especially in terms of performance, since sequential writes have always been the way to maximize disk bandwidth. But let’s focus on the functional advantages of the cursor design.

First, the ACK mechanism for messages can be supported. Since messages are not deleted, you can wait for consumers to explicitly tell Kafka that the message was consumed before updating the cursor. This way, as long as Kafka persists the cursor location, even if the consuming fails and the process crashes, it can be consumed again when it recovers

Second, group consumption can be supported:

Here we need to introduce the concept of a consumer group, which is very simple because a consumer group is essentially a set of cursors. Different consumer groups have their own cursors for the same topic. Monitor group’s cursor points to # 2, BI group’s cursor points to # 4, Trace group points to # 10000… Each consumer cursor is isolated from each other and does not affect each other.

By introducing the concept of consumer groups, it is very easy to support the simultaneous consumption of a topic by multiple businesses, which is known as 1-N “broadcasts,” where a message is broadcast to N subscribers.

Finally, it is also easy to re-consume through a cursor. Because the cursor only records the current consumption of the data, to re-consumption directly modify the value of the cursor can be. You can reset the cursor to any location you want, such as zero to restart consumption, or you can reset the cursor to the end, ignoring all existing data.

So you can see that Kafka is a qualitative leap over Redis’s bidirectional lists, not only in terms of performance, but also in terms of functionality.

Take a look at a simple architecture diagram for Kafka:

From this figure we can see that topic is a logical concept rather than an entity. A topic contains multiple partitions, which are distributed across multiple machines. This machine, in Kafka, is called the broker. A broker in a Kafka cluster corresponds to an instance in a Redis cluster. For a topic, multiple different consumer groups can consume at the same time. A consumer group can have multiple instances of consumers consuming at the same time, which increases the consumption rate.

However, it is important to note that a partition can only be consumed by one consumer instance in the consumer group. In other words, if there are multiple consumers in a consumer group, two consumers cannot consume a partition at the same time.

Why is that? If multiple consumers consume a partition, there is no guarantee that the consumers will receive the data in kafka’s order of distribution, even though Kafka distributes the data sequentially. There is no guarantee that the message order will reach the consumer in this way. Therefore, a partition can only be consumed by one consumer. The cursor for each Consumer group in Kafka can be represented as a data structure like this:

{ “topic-foo”: { “groupA”: { “partition-0”: 0, “partition-1”: 123, “partition-2”: 78 }, “groupB”: { “partition-0”: 85, “partition-1”: 9991, “partition-2”: 772 }, } }

If kafka consumes only moving the cursor without deleting data, then over time data will fill up the disk. This involves Kafka’s retention mechanism, or message expiration, similar to expire in Redis.

If you set a redis list to expire for one minute, redis will delete the list after one minute. Kafka does not delete the entire topic(partition), but only the messages that have expired in the partition. The good news is that Kafka partitions are one-way queues, so messages in queues are produced in an orderly fashion. So every time a message is deleted after expiration, just delete it from the beginning.

It seems simple enough, but if you think about it, there are still a lot of problems. For example, if all messages from topica-partition -0 are written to a file, for example, topica-partition -0.log. To simplify things a bit, if a producer produces a message in topica-partition -0. Log with one message on one line, the file will soon grow to 200 gigabytes. Now tell you, this file before x line is invalid, how should you delete? Very difficult to do, this is like getting you to delete the first n elements in an array, which requires moving the subsequent elements forward, which involves a lot of CPU copy. If the file is 10 MB, this is a very expensive deletion operation, let alone 200 GB.

Therefore, Kafka does another split when it comes to actually storing partitions. Data from topica-partition-0 is not written to a single file, but to multiple segment files. If the size of a segment file is set to 100 MB, a new segment file will be created when 100 MB is written, and subsequent messages will be written to the newly created segment file, just like our business system logs are cut. The advantage of this is that it is easy to delete the entire file directly when all messages in the segment are out of date. Since messages in the segment are ordered, the last item in the segment is out of date.

  1. Data lookup in Kafka

A partition of a topic is a logical array consisting of multiple segments, as shown in the following figure:

This raises the question, if I reset the cursor to an arbitrary position, such as message 2897, how do I read the data?

Based on the file structure above, you can see that we need to determine two things in order to read the corresponding data:

Which segment file is the 2897 message in?

Where is message 2897 in the segment file?

To solve these two problems, Kafka has a very clever design. First, the filename of the segment file is named after the offset of the first message in the file. Create a new segment file named 18234.log. Create a new segment file named 18234.log. Create a new segment file named 18234.log.

  • /kafka/topic/order_create/partition-0
    • 0.log
    • 18234.log #segment file
    • 39712.log
    • 54101.log

When we want to find which segment the message with offset x is in, we just need to do a binary lookup by filename. For example, the segment whose offset is 2879 is clearly in the segment 0. Log file.

After locating the segment file, another problem is finding the location of the message in the file, which is the offset. If you start from scratch, this would be an unacceptable amount of time! Kafka’s solution is to index files.

Just like mysql indexes, Kafka creates an index file for each segment file. The key is the message offset, and the value is the message offset in the segment file:

offset position 0 0 1 124 2 336

Each segment file corresponds to an index file:

  • /kafka/topic/order_create/partition-0
    • 0.log

    • 0.index

    • 18234.log #segment file

    • 18234.index #index file

    • 39712.log

    • 39712.index

    • 54101.log

    • 54101.index

With the index file, we can get the exact location of a message and read it directly. Let’s go through the process again:

When querying the message whose offset is x

Use binary search to find this message in Y.log

Read the y.index file to locate y.log for message X

Obtain the corresponding position of Y.log to obtain data

With this file organization, we can read and fetch any message very quickly in Kafka. However, this leads to another problem. If the number of messages is very large, adding a record to the index file for each message can waste a lot of space.

A simple calculation shows that if a record in index is 16 bytes (offset 8 + position 8), 100 million messages are 16*10^8 bytes =1.6 gigabytes. For a larger company, Kafka collects far more than 100 million logs a day, perhaps tens or hundreds of times more. In this case, the index file takes up a lot of storage. Therefore, Kafka chooses to use a “sparse index” as a trade-off.

Sparse index means that not all messages record its position in the index file. How many messages are recorded every interval, for example, one offset-position is recorded every 10 messages:

offset

position

00 10 1852 20 4518 30 6006 40 8756 50 10844

In this way, if we want to query the message whose offset is x, we may not be able to find its exact position, but we can use binary search to quickly determine the position of the message closest to it, and then read a few more pieces of data later to read the desired message.

For example, when we want to find the message with offset 33, according to the above table, we can use binary search to locate the message with offset 30, and then read 3 messages backwards from this position in the corresponding log file. The fourth one is 33. This approach is a compromise between performance and storage space, and many systems are faced with a similar choice between sacrificing time for space or sacrificing space for time.

At this point, we should have a clear idea of kafka’s overall architecture. However, in this analysis, I have deliberately omitted another very, very important point in Kafka, which is the design aspect of high availability. Because this section is somewhat obscure, it introduces a lot of the complexity of distributed theory, hindering our understanding of Kafka’s basic model. In the following sections, we will focus on this topic.

  1. Kafka high availability

High availability (HA) is critical to an enterprise’s core systems. With the development of services, the scale of clusters increases. However, faults always occur in large-scale clusters, and the hardware and network are unstable. When some nodes fail to work properly for various reasons, the system tolerates the fault and continues to provide services. This is called high availability. For stateful services, tolerating partial failures is essentially tolerating lost data (not necessarily permanently, but not being read for at least some time).

The simplest and only way for a system to tolerate lost data is to make backups, copies of the same data across multiple machines, known as redundancy, or multiple copies. To this end, Kafka introduces the concept of leader-follower. Each partition of a topic has a leader, and all reads and writes to the partition are performed on the broker where the leader resides. The partition data is copied to other brokers, whose corresponding partitions are followers:

When producing messages, the producer sends the messages directly to the partition leader, who writes the messages to the partition’s log and waits for followers to pull the data for synchronization. Specific interactions are as follows:

In the figure above, the timing of ack against producer is critical because it is directly related to the availability and reliability of kafka clusters.

If the data from the producer reaches the leader and is successfully written to the leader’s log, an ACK is performed

Advantages: No need to wait for the completion of data synchronization, fast, high throughput, high availability; Disadvantages: If the follower leader hangs before the follower data synchronization is complete, data loss may occur, resulting in low reliability.

Advantages: After the leader dies, the followers also have complete data, which provides high reliability. Disadvantages: slow synchronization of all followers, poor performance, prone to production timeout, low availability.

When to ack kafka can be configured according to the actual application scenario.

In fact, kafka real data synchronization process is still very complex, this article is mainly to talk about some of the core principles of Kafka, data synchronization involved in a lot of technical details, HW epoch, etc., is not here to expand one by one. Finally, a panoramic view of Kafka:

Kafka improves throughput by introducing partitions, which allow topics to be spread across multiple brokers. However, the cost of introducing multiple partitions is that global ordering of topic dimensions is not guaranteed, and scenarios requiring this feature can only use a single partition. Internally, each partition is stored in the form of multiple segment files. New messages are appended to the latest segment log file, and sparse indexes are used to record the location of messages in the log file, which facilitates fast reading of messages. When the data segment expires, delete the expired segment file. To achieve high availability, each partition has multiple copies, one of which is the leader and the others are followers, distributed among different brokers. All reads and writes to the partition are performed on the broker where the leader resides. The followers only periodically pull the leader’s data for synchronization. When the leader dies, the system selects the follower who keeps pace with the leader as the new leader to continue to provide external services, greatly improving the availability. On the consumer side, Kafka introduces the concept of consumer groups. Each consumer group can consume topics independently of each other, but a partition can only be consumed by a single consumer in the consumer group. By recording cursors, consumer groups can realize ACK mechanism, repeated consumption and other features. Almost all meta information is stored in the global ZooKeeper, except for the actual messages logged in the segment.

  1. The advantages and disadvantages

(1) Advantages: Kafka has many advantages

High performance: 100W TPS in single machine test;

Low latency: low latency for production and consumption, and low latency for E2E in a normal cluster;

High availability: Replicable + ISR + election mechanism guarantee;

Mature tool chain: complete monitoring o&M management solutions;

Ecological maturity: Kafka Stream is essential for big data scenarios.

(2) Insufficient

The partition cannot be flexibly expanded. All reads and writes to the partition are on the broker where the partition leader resides. If the broker is overloaded, a new broker cannot be added to solve the problem.

The cost of capacity expansion is high: New brokers in the cluster can only handle new topics. To share the burden of old topic-partitions, partitions need to be migrated manually, which consumes a lot of cluster bandwidth.

Rebalance: Make data consumption repeat, affect consumption speed, increase E2E delay.

Too many partitions can degrade performance significantly: ZK pressure is high, and too many partitions on the broker degrade sequential disk writes to almost random writes.

After you understand Kafka’s architecture, you can ask yourself, why does Kafka have so much trouble scaling up? In fact, this is essentially the same as redis cluster expansion! When a hot key appears in the Redis cluster, an instance can’t support it. You can’t solve the problem by adding machines, because the hot key still exists in the previous instance. Kafka is similar in that it expands in two ways: by adding brokers and partitions to topics. Add a partition to a topic. For example, the user order table is divided into 1024 subtables by user ID. 1023}, if it is not enough in the later period, it will be troublesome to increase the number of sub-tables. As the number of sub-tables increases, the hash value of user_id changes, and old data cannot be queried. So we had to stop service for data migration, and then come back online. Kafka does the same for adding partitions to topics. For example, in some scenarios where MSG contains keys, producer must ensure that the same keys are placed on the same partition. However, if the number of partitions increases, hash based on keys, such as hash(key) % parition_num, will get different results. Therefore, the same key cannot be stored in the same partition. It’s also possible to implement a custom partitioner on producer that guarantees that the same key will fall on the same partition no matter how many partitions are expanded, but that will cause the newly added partition to have no data at all.

One problem you can see is that Kafka’s core complexity is almost all in storage. How to fragment data, how to store data efficiently, how to read data efficiently, how to ensure consistency, how to recover from errors, how to expand and rebalance…

The shortcomings above sum up in one word: scalebility. The ultimate goal is a system that solves problems by simply adding machines. Pulsar claims to be the distributed messaging and streaming platform for the cloud native era, so let’s take a look at what Pulsar is up to. Fourth, the Pulsar

The core complexity of Kafka is its storage. High performance, high availability, low latency and distributed storage that supports rapid expansion are not only the requirements of Kafka, but also the common pursuit of all modern systems. The Apache project has a system built specifically for logging called BooKeeper!

With a dedicated storage component, all that remains to implement a messaging system is how to use the storage system to implement features. Pulsar is one such “compute-storage split” messaging system:

Pulsar uses BooKeeper as the storage service, and the rest is the computing layer. This is actually a very popular architecture and a trend, and a lot of new storage is this kind of “memory separation” architecture. For example, tiDB, the underlying storage is actually TIkV, this KV storage. Tidb is a higher computing layer that implements SQL-related functions on its own. Other examples are many “persistent” Redis products, most of the underlying rely on Rocksdb to do KV storage, and then based on the KV storage relationship to implement redis data structures.

In Pulsar, the meaning of a broker is the same as in Kafka, that is, a running instance of Pulsar. In contrast to Kafka, pulsar’s broker is a stateless service. It is simply an “API interface layer” that handles a large number of user requests. When a user’s message arrives, it calls the BooKeeper interface to write data, and when the user wants to query the message, it looks up data from BooKeeper. Of course the broker itself does a lot of caching as well. The broker also relies on ZooKeeper to hold many metadata relationships.

Since the broker itself is stateless, this layer can be scaled up very, very easily, especially with the click of a mouse in a K8S environment. Message persistence, high availability, fault tolerance, and storage expansion are all taken care of by BooKeeper.

But just like the law of conservation of energy, the complexity of systems is conserved. The technical complexity required to achieve storage that is both high-performance and reliable does not disappear into thin air, but simply shifts from one place to another. Just like when you write business logic and the product manager comes up with 20 different business scenarios, there are at least 20 if and else, and no matter what design pattern and architecture you use, those if and else are not eliminated, they are just moved from one file to another, from one object to another. So those complexities are bound to be present in BooKeeper, and more complex than kafka’s storage implementation.

However, one of the benefits of pulSAR memory segregation architecture is that we can have a relatively clear boundary called concerned segregation when we learn pulSAR. If you understand the API semantics that BooKeeper provides to the upper-layer broker, you can understand pulsar without understanding the internal implementation of BooKeeper.

Then you can ask yourself: since pulsar’s broker layer is a stateless service, can we randomly produce data to a topic at a broker?

It may seem like nothing, but the answer is no — no. Why is that? If a producer can produce three messages a, B, and C on any one of the brokers, then how can the messages be written to BooKeeper in the order A, B, and C? There is no guarantee that a, B, and C messages will be written in the same order as they are sent to the same broker.

In this case, it seems to come back to the same problem as Kafka. What if a topic is too heavy for a broker to handle? So Pulsar, like Kafka, has partitions. A topic can be divided into multiple partitions, and for the order of messages within each partition to be consistent, production for each partition must correspond to the same broker.

This may seem the same as Kafka, where each partition corresponds to a broker, but it is quite different. To ensure sequential writes to partitions, both Kafka and Pulsar require write requests to be sent to the broker corresponding to the partition, which guarantees sequential writes. The difference, however, is that Kafka also stores messages to the broker, whereas Pulsar stores messages to BooKeeper. The advantage of this is that when a broker in pulsar fails, it can immediately switch the corresponding partition to another broker, as long as only one broker globally has write permission on topi-partition -x. Essentially, this is just a transfer of ownership. There will not be any relocation of data.

When a write request to a partition reaches the broker, the broker calls the interface provided by BooKeeper for message storage. Pulsar, like Kafka, has segments, and just like Kafka, pulsar stores segments.

Ledger ledger ledger ledger ledger ledger ledger Ledger can be likened to a file on a file system. In Kafka, for example, it is written to xxx.log. Pulsar saves ledger in booKeeper in segment.

Each node in a BooKeeper cluster is called a bookie. It doesn’t matter, it’s just a name, the author has written so much code, it still can’t make people happy with a name). To instantiate a BooKeeper writer, you need to provide three parameters:

Number of nodes n: number of BookIes in the BooKeeper cluster.

M: A ledger will be written to m of n bookies, that is, m copies;

Confirm write count t: each time data is written to the ledger (m bookies concurrently), t acks must be received before the return is successful.

Bookeeper does complex data synchronization for us based on these three parameters, so we don’t have to worry about duplicates and consistency, just call booKeeper’s append interface and let it do the rest.

As shown in the figure above, Parition is divided into multiple segments, each of which is written into three of the four bookie. For example, segment1 is written to bookie1,2,4, and segment2 is written to bookie1,3,4…

This is equivalent to distributing the segments of a Kafka partition evenly across multiple storage nodes. What are the benefits of this? In Kafka, a partition writes to a file system of the same broker all the time. When the disk runs out, you need to do the cumbersome operations of expanding and migrating data. For Pulsar, because different segments of partition can be saved in different Bookies of BooKeeper, when a large number of writes cause insufficient disk of existing cluster bookie, we can quickly add machines to solve the problem. Let new segments find the best bookie to write to, just remember the relationship between segment and bookies.

Partition is evenly distributed to booKeeper nodes based on segment granularity, making storage capacity expansion very easy. This is also a demonstration of the advancements Pulsar has been claiming for the memory separation architecture:

Brokers are stateless and expand freely.

Partition is distributed to the entire BooKeeper cluster by segment. Without a single point, it can be easily expanded.

When a bookie fails, due to the existence of multiple copies, one of the other T-1 copies can be randomly selected to read data, uninterrupted external services, to achieve high availability.

In fact, if you look at Pulsar after understanding kafka’s architecture, the core of Pulsar is the use of BooKeeper and some metadata storage. On the other hand, the proper separation of storage and computing architecture helps us separate our concerns so that we can learn quickly.

Consumption model

Another design that Pulsar advances over Kafka is an abstraction of the consumer model, called Subscription. Through this layer of abstraction, users can be supported in a variety of consumption scenarios. Again, compare this to Kafka, where there is only one consumption pattern, one or more partitions versus one consumer. If you want to have one partition versus multiple consumers, you can’t do that. Pulsar currently supports 4 consumption methods through Subscription:

Pulsar subscription can be thought of as a Consumer group for Kafka, but it goes one step further and sets the consumption type of the “consumer group” :

Exclusive: There is only one consumer in the group that can consume, and the others cannot connect to pulsar at all.

Failover: Each consumer in a consumer group can connect to the broker of each partition, but only one consumer can consume data. When this consumer collapses, another consumer is chosen to take over;

Shared: All consumers in a consumer group can consume all partitions in a topic. Messages are distributed in a round-robin manner.

Key-shared: All consumers in a consumer group can consume all partitions in a topic, but messages with the same key are guaranteed to be sent to the same consumer.

These consumption models can meet multiple business scenarios, and users can choose according to the actual situation. Through this abstraction, Pulsar supports both the Queue consumption model and the Stream consumption model, as well as countless other consumption models (as long as anyone mentions PR), which is what Pulsar calls a unified consumption model.

In fact, under the abstraction of the consumption model, there are different cursor management logic. How to ack, how to move the cursor, how to quickly find the next MSG that needs to retry… These are all technical details, but with this layer of abstraction, you can hide those details and focus on the application. 5. Storage and computation separation architecture

In fact, the development of technology is always in a spiral, many times you will find the latest direction of development back to the technology of 20 years ago.

Twenty years ago, due to the limitations of ordinary computer hardware, the Storage of large amounts of data was done through centralized Storage in the cloud, such as NAS. However, this approach also has many limitations. Not only does it require dedicated hardware, but the biggest problem is that it is difficult to expand to accommodate massive data storage.

Database is also mainly based on Oracle minicomputer program. However, with the development of the Internet and the increasing amount of data, Google later launched a distributed storage scheme based on ordinary computers. Any computer can be used as a storage node, and then these nodes can work together to form a larger storage system, which is HDFS.

However, the mobile Internet further increases the amount of data, and the popularity of 4G and 5G makes users very sensitive to delay. Reliable, fast and scalable storage has gradually become a rigid demand of enterprises. And as time goes by, the traffic concentration of Internet applications will become higher and higher, and the demand of large enterprises will become stronger and stronger.

As a result, reliable distributed storage as an infrastructure continues to improve. They all have the same goal of letting you use them like Filesystem, with high performance, high reliability, automatic error recovery, and other features. However, one problem we need to face is the limitation of CAP theory. Linear consistency (C), availability (A), and fault tolerance (P) of partition can only satisfy both. Therefore, there is no perfect storage system, there is always some “insufficient”. What we need to do is to choose appropriate storage facilities based on different business scenarios to build upper-layer applications. This is the logic of Pulsar, newSQL such as TiDB, and the basic logic of future large distributed systems, so-called “cloud native”.