Abstract: Pulsar, as a cloud-native distributed message flow platform, has appeared more and more frequently in people’s eyes, and has a tendency to replace Kafka.

This article is shared by MRS Pulsar: A New Release of the Next Generation Distributed Message Flow Platform! By Lothar.

Pulsar’s past life

Apache Pulsar is a publish-subscribe messaging system that uses a cloud-native architecture that separates computing and storage. Pulsar became ASF’s top project in September 2018. In recent two years, with the continuous development of the community and the application and contribution of many enterprises, Pulsar, as a cloud-native distributed message flow platform, has appeared more and more frequently in people’s view, and has a tendency to replace Kafka.

Pulsar vs. Kafka

The biggest architectural difference between Pulsar and Kafka is that Kafka uses the Broker to send and receive messages and persist them. Data is stored on a local file system and managed by the Broker. This also means that data and message processing are coupled.

Kafka relies heavily on file systems for storing or caching messages, according to the Kafka website. When the Broker receives a message, it appends it to the local disk. This architecture determines that the mapping between partitions and brokers is relatively fixed, and data migration occurs only when partitionreassign occurs. The Leader of a Partition is created on the data copy distribution node to process production and consumption requests.

Pulsar uses a computing and storage separation architecture, which is the main reason Pulsar is called a cloud native platform. Pulsar relies on Apache BookKeeper, an extensible, fault-tolerant, low-latency log storage service that guarantees low-latency reads and writes with strong persistence, to manage persistent data.

Website * quotations from Pulsar: pulsar.apache.org/docs/en/con…

After the Broker receives the request, the data is actually distributed and stored in the BookKeeper service. In the physical storage model of data, data for a Topic or Partition is not stored on a Bookie instance.

Pulsar divides the distributed log into multiple segments, each corresponding to a Ledger in BookKeeper. Unlike Kafka, which stores logs for a Partition in a fixed directory, Pulsar can Segment the same topic or Partition to different Bookie.

Advantages of Pulsar

Flexible extend

Many of Kafka’s customers have similar experiences:

  • The disk space is insufficient. You can only adjust the TTL of data or migrate partitions to a new Broker after expanding the machine

  • Data is distributed unevenly between topics or partitions, and usage is uneven between nodes or disks. Some disks are full, while others have a lot of space

  • The Broker machine is faulty and needs to be powered off after data migration to other nodes

Pulsar’s memory separation architecture naturally avoids these problems. The PulsarBroker itself is stateless, and when one Broker fails, another Broker can immediately take over the corresponding Topic without migrating data. BookKeeper distributed logs ensure data balance among storage nodes and prevent I/O concentration on a node due to excessive data in one Partitoin or Topic.

When the cluster needs to be expanded, the Broker can immediately become aware of the newly added Bookie and store the newly written data into the newly added Bookie.

multi-tenant

The Kafka community in KiP-37 is talking about adding NameSpace to implement multi-tenant features, which Pulsar has already implemented. In the enterprise, message queue services are typically used by multiple teams, and when Kafka is used, it is sometimes necessary to maintain a Kafka cluster for each team. Pulsar can be configured with multiple tenants. Each tenant can have multiple namespaces. The administrator can control the access to and manage the quota of namespaces.

More flexible subscription models

Kafka divides messages into two layers. For Kafkaconsumers that belong to the same Group, the messages they receive are mutually exclusive, that is, a message can only be processed by one Consumer in the Group. For different groups, a message will be processed by both groups at the same time, and the message is shared.

Pulsar offers a more flexible subscription model:

  • Exclusive:

At any time, data in a Topic can only be consumed by one Consumer in the Group, and other consumers are not allowed to get messages

  • The main case type:

When multiple consumers consume the same Topic at the same time, only one Consumer is selected as the primary Consumer and the other consumers become standby consumers. When the primary Consumer fails, an active/standby switchover occurs, and one of the standby consumers becomes the primary and continues message consumption.

  • Shared:

Similar to Kafka, messages are circulated to different consumers in a shared mode, and when one Consumer fails, messages are redistributed to other consumers.

A hierarchical

Another attractive feature of Pulsar is that streaming data can be cooled and stored on cheaper storage media. Streaming systems are typically equipped with high-performance SSDS to ensure performance. For Kafka, all messages that need to be retained must reside on expensive SSDS. In some cases, data written for a period of time is no longer in use, but should still be archived for a period of time. Pulsar supports dumping this cold data to an offline storage system, and BookKeeper saves a lot of storage costs by keeping only a portion of the hot data. This feature is definitely valuable, and the Kafka community is also working on it (KIP-405), but it is not yet implemented.

Performance indicators of Pulsar

Both the Kafka and Pulsar communities have conducted comparative testing for performance. In general, Pulsar is more persistent than Kafka because of fsync when Pulsar data falls to the disk. The Pulsar community modified this and conducted comparative tests. Some of the test results are as follows:

* From Pulsar community performance Test report

When the local persistence level is set to the same as Kafka, the throughput of Pulsar is almost the same as Kafka.

* From Pulsar community performance Test report

When the number of partitions increases to 2000, Pulsar’s default local persistence throughput is roughly equal to Kafka’s.

For more details please visit SreamNative’s Benckmarking test Report: Benchmarking Pulsar Kafka a More Accurate Perspective onpulsar performance.pdf

MRS Pulsar

MRS has released a POC version of Pulsar, enabling customers to deploy Pulsar services with one click, including Broker and Bookie roles. Support to modify Pulsar configuration, start, stop, and monitor on the Web UI.

MRS also integrates KoP by default. KoP is an open source plug-in for Pulsar. It runs on Pulsar and is compatible with Kafka protocol. When used, Kafka client can change the connection address and directly switch to Pulsar cluster without changing the service dependency on Kafka client.

The commercial version of MRS Pulsar is being planned, and we will explore more possibilities of Pulsar in cloud applications, further leverage the advantages of Pulsar storage and computation separation, reduce costs, improve resource utilization, and create more value for customers. Please look forward to it.

Click to follow, the first time to learn about Huawei cloud fresh technology ~