Translated by Zhai Jia

In the previous article, we described why Apache Pulsar can become an enterprise-class flow and messaging system. Pulsar’s enterprise features include persistent message storage, multi-tenancy, multi-room interconnection, encryption, and security. One question we are often asked is how Apache Pulsar is different from Apache Kafka.

In this series of Pulsar and Kafka comparison articles, we’ll walk you through some of the important concerns in messaging systems, such as robustness, high availability, and high bandwidth and low latency.

The message model is the first thing that users consider when choosing a messaging system. The message model should cover the following three aspects:

  1. Message consumption – how messages are sent and consumed;
  2. Message acknowledgement (ACK) – How to acknowledge a message;
  3. Message save – How long messages are kept, what triggers message deletion and how to delete them;

Message consumption model

In a real-time streaming architecture, messaging can be divided into two types: queues and streams.

The Queue model

The queue model mainly consumes messages in an unordered or shared manner. With the queue model, users can create multiple consumers to receive messages from a single pipe; When a message is sent from the queue, only one (any one) of multiple consumers receives and consumes the message. The implementation of the messaging system determines which consumer actually receives the message.

The queue model is often used in conjunction with stateless applications. Stateless applications don’t care about sorting, but they do need to be able to acknowledge (ACK) or delete single messages, as well as extend the ability to consume parallelism as much as possible. Typical queue-based messaging systems include RabbitMQ and RocketMQ.

The Stream model

In contrast, the flow model requires the consumption of messages to be either strictly ordered or exclusive. For a pipe, with the streaming model, only one consumer will always consume and consume messages. Consumers receive messages sent from the pipe in the exact order in which they are written to the pipe.

Flow models are typically associated with stateful applications. Stateful applications pay more attention to the order of messages and their state. The order in which messages are consumed determines the state of a stateful application. The order of messages affects the correctness of the application’s processing logic.

In microservice-oriented or event-driven architectures, both a queue model and a flow model are required.

Pulsar’s message consumption model

Apache Pulsar abstracts a unified producer-topic-subscription consumption model through subscriptions. Pulsar’s message model supports both a queue model and a flow model.

In Pulsar’s message consumption model, a Topic is a channel for sending messages. Each Topic corresponds to a distributed log in Apache BookKeeper. Each message published by a publisher is stored only once in the Topic; During storage, BookKeeper copies the message and stores it on multiple storage nodes. Each message in a Topic can be used multiple times based on a Consumer’s subscription needs, with each subscription corresponding to a Consumer Group.

Topics are the real source of consumption information. Although messages are only stored once on a Topic, users can consume them in different subscription ways:

  • Consumers are grouped together to consume messages, and each consumer group is a subscription.
  • Each Topic can have different consumer groups.
  • Each group of consumers is a subscription to a topic.
  • Each group of consumers can have its own consumption mode: Exclusive, Failover, or Share.

Through this model, Pulsar combines the queue model and the flow model to provide a unified API interface. This model will not affect the performance of the messaging system, nor will it bring extra overhead, but also provides users with more flexibility, convenient user programs to use the messaging system in the most matching pattern.

Exclusive subscription (Stream model)

As the name implies, in an exclusive subscription, there is one and only one consumer in a consumer group (subscription) consuming messages in a Topic at any one time. Below is an example of an exclusive subscription. In this example, there is an active consumer A-0 with subscription A, and messages m0 through M4 are sent sequentially and consumed by A-0. If another consumer A-1 wants to attach to subscription A, it is not allowed.

Failover (Stream model)

With failover subscriptions, multiple consumers can be attached to the same subscription. However, of all the consumers in a subscription, only one will be selected as the primary consumer for that subscription. Other consumers will be designated as failover consumers.

When the primary consumer disconnects, the partition is reassigned to one of the failover consumers, and the newly assigned consumer becomes the new primary consumer. When this happens, all unacknowledged (ACK) messages are delivered to the new primary consumer. This is similar to Consumer Partition Rebalance in Apache Kafka.

Below is an example of failover subscription. Consumers B-0 and B-1 consume messages by subscribing to B subscriptions. B-0 is the primary consumer and receives all messages. B-1 is the failover consumer, which will take over the consumption if consumer B-0 fails.

Shared subscriptions (Queue Queue model)

With shared subscriptions, the user mounts as many consumers as the application needs behind the same subscription. All messages in a subscription are sent to multiple consumers behind the subscription in a circular distribution, and one message is delivered to only one consumer.

When a consumer disconnects, all messages delivered to it that have not been acknowledged (ACK) are reallocated and organized to be sent to the remaining remaining consumers on that subscription.

Below is an example of a shared subscription. Consumers C-1, C-2, and C-3 all consume messages on the same topic. Each consumer receives about one-third of all messages.

To speed up consumption, users don’t need to increase the number of partitions, just add more consumers to the same subscription.

Choice of three subscription models

Exclusive and failover subscriptions allow only one consumer to use and consume each subscription to a topic. Both patterns use messages in topic partition order. They are best suited for Stream use cases where strict message order is required.

Shared subscriptions allow multiple consumers per topic partition. Each consumer in the same subscription receives only a portion of the message for the topic partition. Shared subscriptions work best with queues that do not require message order and can be extended as many consumers as necessary.

Subscriptions in Pulsar are actually similar to the concept of Consumer groups in Apache Kafka. The operation of creating subscriptions is lightweight and highly scalable, allowing users to create as many subscriptions as their application needs. Different subscription types can also be used for different subscriptions to the same topic. For example, a user can provide a failover subscription of three consumers on the same topic, as well as a shared subscription of 20 consumers, and add more consumers to the shared subscription without changing the number of partitions. The following figure depicts A topic with three subscriptions A, B, and C and illustrates how messages flow from producer to consumer.

In addition to the unified messaging API, since the Pulsar topic partition is actually stored in Apache BookKeeper, it also provides a read API (Reader), similar to the consumer API (but without cursor management for Reader), So that the user has complete control over how the messages in the Topic are used.

Pulsar Message Confirmation (ACK)

Due to the nature of distributed systems, failures can occur when using distributed messaging systems. For example, when a consumer consumes a message from a topic in the messaging system, errors can occur both to the consumer consuming the message and to the message Broker serving the topic partition. The purpose of message acknowledgement (ACK) is to ensure that when such a failure occurs, the consumer can pick up where it left off without losing the message or reprocessing the ACK message. In Apache Kafka, the recovery point is usually called Offset, and the process of updating the recovery point is called message confirmation or commit Offset.

In Apache Pulsar, a specialized data structure, Cursor, is used within each subscription to track the ACK status of each message within the subscription. The cursor is updated each time a consumer confirms a message on the topic partition. Updating the cursor ensures that consumers do not receive messages again.

Apache Pulsar provides two message acknowledgement methods: Individual Ack and Cumulative Ack. With cumulative confirmation, the consumer only needs to confirm the last message it received. All messages (including) providing message ids in the topic partition will be marked as confirmed and will not be passed back to the consumer. Cumulative validation is similar to Offset updates in Apache Kafka.

Apache Pulsar supports single validation of messages, also known as selective validation. Consumers can individually confirm a message. Confirmed messages will not be redelivered. The following figure illustrates the difference between single and cumulative acknowledgments (messages in the gray box are acknowledged and not redelivered). In the upper half of the graph, which shows an example of cumulative confirmation, messages prior to M12 are tagged as acked. In the bottom half of the figure, it shows an example of doing acking alone. Confirm only messages M7 and M12 – in case of consumer failure, all messages except M7 and M12 will be retransmitted.

Consumers of exclusive or failover subscriptions can make single and cumulative acknowledgments of messages; Consumers of shared subscriptions are only allowed to make single acknowledgements to messages. The ability to send a single confirmation message provides a better experience for handling consumer failures. For some applications, processing a message can take a long time or be very expensive, and it is important to prevent retransmission of confirmed messages.

Cursors, the specialized data structures that manage Ack, are managed by the Broker and provided by BookKeeper’s Ledger. We’ll cover cursors in more detail in a later article.

Apache Pulsar provides flexible message consumption subscription types and message validation methods that support a wide variety of message and flow usage scenarios through a simple unified API.

Pulsar message Retention

After a message is acknowledged, Pulsar’s Broker updates the corresponding cursor. A message in a Topic can be deleted only after it has been ack by all subscriptions. Pulsar also allows messages to be kept longer by setting retention times, even if all subscriptions have confirmed consuming them. The following figure shows how to keep messages in a topic that has two subscriptions. Subscription A has consumed all messages before M6 and subscription B have consumed all messages before M10. This means that all messages prior to M6 (in the gray box) can be safely deleted. Subscription A is still not using messages between M6 and M9 and cannot delete them. If A topic has A message retention period configured, messages M0 through M5 will remain unchanged for the configured time period, even if A and B have confirmed consuming them.

Among message retention policies, Pulsar also supports message time to Live (TTL). If a message is not used by any consumer within the configured TTL period, the message is automatically marked as acknowledged. Message retention Period The difference between message TTL is that the message retention period applies to messages marked as acknowledged and set to deleted, while TTL applies to unack messages. TTL in Pulsar is illustrated in the illustration above. For example, if subscription B has no active consumer, the message M10 will automatically be marked as confirmed after the configured TTL period, even if no consumer actually reads the message.

Pulsar VS. Kafka

Through the above aspects, we summarize the differences between Pulsar and Kafka in the message model.

Modeling concepts

Kafka: producer-topic-consumer group-consumer;

Pulsar: producer-topic-subscription consumer.

Consumption patterns

Kafka: consume exclusively on a single partition in a Stream mode. There is no Queue mode.

Pulsar: Provides a unified messaging model and API. Stream mode — exclusive and failover subscription; Queue pattern — The way subscriptions are shared.

Message Confirmation (Ack)

Kafka: use Offset;

Pulsar: Managed using a dedicated Cursor. Cumulative validation works the same as Kafka; Provide single or optional confirmation.

Messages reserved

Kafka: Deletes messages based on the set retention period. It is possible that the message was not consumed and was deleted after expiration. TTL is not supported.

Pulsar: Messages are deleted only after they have been consumed by all subscriptions, without data loss. Retention periods can also be set to retain consumed data. Support TTL.

Comparison summary:

Apache Pulsar combines high-performance streams (what Apache Kafka is after) and flexible traditional queues (what RabbitMQ is after) into a unified messaging model and API. Pulsar uses a unified API to provide users with a system that supports streams and queues with the same high performance.

conclusion

In this blog post, we introduced Apache Pulsar’s messaging model, which unifies queuing and streaming traffic into one API. Applications can use this unified API for high-performance queuing and streaming without having to maintain two systems: RabbitMQ for queuing and Kafka for streaming. Hopefully this article has given you an idea of how the message model, message consumption, deletion, and retention work in Apache Pulsar; Understand the differences between Pulsar and Kafka message models. In a later article, we will introduce you to the architectural details of Apache Pulsar and the differences between Pulsar and Apache Kafka in terms of data distribution, replication, availability, and persistence.


If you are interested in Pulsar, you can participate in the Pulsar community in the following ways:

  • Pulsar Slack channel: apache-pulsar.slack.com/

    Register here: apache-pulsar.herokuapp.com/

  • Pulsar mailing list: pulsar.incubator.apache.org/contact

General information about the Pulsar Apache project, please visit the website: pulsar.incubator.apache.org/ @ apache_pulsar can also focus on Twitter account.