The original author is Ivan Kelly. We translate and publish the blog with the author’s permission.

When people talk about a stream processing engine (SPE), they often refer to becat-once (or exactly-once) assurance. Typically, a large data pipeline contains multiple components, and any component in the pipeline can fail, and the SPEs are usually just one of the small components. If the user expects the data pipe to provide a becque-once guarantee, other (non-SPE) components in the pipe need to provide the guarantee accordingly.

This article describes the types of guarantees Apache Pulsar can implement, and how to implement them.

An example word count application is shown below:

As shown in the figure above, data from a data source (such as the Twitter Firehose) is first pushed to the messaging system, which then transfers the data to the SPE and processes the data (word count). Finally, the data processing results are stored in the word count database, which can also be further used for analysis.

The article “becat-once in Apache Pulsar” discusses how the SPE (and especially Heron) can implement becat-once assurance. However, in the word count application shown above, both the messaging system and the data source (when pushing data to the messaging system) can fail.

www.splunk.com/en_us/blog/…

To ensure that the data pipeline can be effectively-once, the message system and data source need to be effectively-once guaranteed. Word count databases can also fail, further complicating the problem. However, this database is only used to store a copy of the data of the word count node. Therefore, even if the database fails, users can easily recover the database. This point will not be covered further in this article.

When a fault occurs in the message system or SPE, to provide becat-once guarantee, you need to specify a data point in time before the fault occurs. The SPE can reconnect to the message system at this point in time to read data. When reconnecting, the SPE needs to receive all the information it received before the failure. Message loss or duplicate failure should not occur. For most users, the message order should be the same.

Although the “same order” requirement is not an correctness requirement per se, consider the case where messages are in different order. To ensure that messages are not duplicated, the message receiver (that is, the SPE) must keep track of all received and processed messages. The longer the tracking time, the greater the storage requirements. Therefore, spEs are not suitable for long running systems. Therefore, the same order of messages is not a requirement, but it does facilitate the operation of the system.

No messages are lost or duplicated, and the messages are in the same order. This enables full order atomic broadcast (TOAB), commonly referred to as consensus in distributed systems. En.wikipedia.org/wiki/Consen…

In order for a messaging system to implement becquery-once assurance, it must implement consensus or rely on other external systems to help it achieve consensus.

The open source community provides many common systems for achieving consensus (www.consul.io/). However, these systems generally… K-v) interface, not suitable for storing message flows. Furthermore, key-value interfaces store data on a single log copy of the cluster. That is, when multiple streams are scaled out (horizontally) for the messaging system, the operation of these interfaces can become very problematic. Therefore, the most user-friendly system should provide both a log-like interface and the ability to scale out multiple streams.

Apache BookKeeper is one of the few systems that can offer both. BookKeeper achieves the consistency required by ZooKeeper and extends it horizontally so that a copy of the log can be stored.

For more information on how BookKeeper ensures the extensibility of full-order atomic broadcasts, see the article: www.splunk.com/en_us/blog/…

Apache Pulsar uses BookKeeper to store the message flow associated with each topic, so the becat-once guarantee can be achieved. BookKeeper provides Pulsar with a full-order atomic broadcast guarantee, so clients can reconnect to any point in time. Pulsar implements this connection through a cursor.

Pulsar uses the application that publishes the message to push the data to the messaging system, so it needs to implement the effiectively-once guarantee of the data. Pulsar implements message effiectively-once guarantee through message deduplication mechanism.

For more information, see the article becauti-once in Apache Pulsar. www.splunk.com/en_us/blog/…