This interview is the content of OSChina’s interview with StreamNative co-founder and CTO, Apache Pulsar PMC Zhai Jia, by OSChina. In this interview, we will focus on the advantages of Apache Pulsar as a cloud native streaming data tool for message processing, some comparison with Kafka, and the introduction and direction of StreamNative company. The original author is a gentleman of open source China.

StreamNative, an open-source streaming data company, recently announced that it has completed A multi-million dollar pre-A round of funding and has officially joined CNCF. Its founding team members are native core developers for Apache Pulsar, Apache BookKeeper projects, and StreamNative is also known as the company behind Pulsar, an open source messaging system infrastructure. (In this article, Pulsar and BookKeeper refer to Apache Pulsar and Apache BookKeeper respectively)

StreamNative is a Pulsar based commercial company that provides cloud-native real-time messaging and streaming data processing technology. Pulsar was built internally by Yahoo in 2012 to build a unified messaging platform using a layered, sharding system architecture. The upper-layer Pulsar Broker provides a stateless service layer; The underlying BookKeeper provides high performance, low latency, and strong consistency I/O services.

At the Pulsar Summit in June, Splunk and Yahoo tested and analyzed that Pulsar helped Splunk reduce costs by 1.5 to 2 times, latency by 5 to 50 times, and operating costs by 2 to 3 times. In the Yahoo deployment, Pulsar supports the same volume of traffic and consumes half the actual hardware resource cost of Apache Kafka while maintaining higher data quality.

However, Kafka is still very well known and used in the current open source distributed messaging system, and is also a top project of the Apache Foundation. What is Pulsar better than Kafka, both technically and ecologically? Is there more room for development in the future? How do Pulsar and StreamNative coexist? Why did StreamNative get the funding, and what are the implications for other open source project companies?

We spoke with StreamNative co-founder and CTO Jia Zhai to learn more about StreamNative’s products and team, as well as its Pulsar and BookKeeper projects.

Guest introduction:

Jia Zhai is co-founder and CTO of StreamNative. Prior to founding StreamNative, He worked on design and development related to distribution, file systems and streaming storage at EMC and is currently a PMC member on two projects, Apache BookKeeper and Apache Pulsar.

Open source messaging system infrastructure, Pulsar

Zhai explained that Pulsar was created in 2012 with the original purpose of building a unified logic large cluster messaging platform within Yahoo to replace other messaging systems. Other messaging systems at the time, including Kafka, were unable to meet Yahoo’s needs, such as large cluster multi-tenant, reliable IO quality of service, mega Topic, and cross-region replication, so Pulsar was born.

“At that time, it was called CMS (Cloud Message Service) inside Yahoo. From the name, we can see that Pulsar firmly set the direction of Cloud when it was born,” said Zhai Jia. In 2015, Pulsar was deployed and replaced inside Yahoo. Large scale deployments within Yahoo, serving large scale scenarios such as Yahoo Mail, Yahoo Finance, Yahoo Sports/ Flickr, and advertising platforms. Pulsar was donated to the Apache Foundation in June 2017 and graduated as an ASF Top project in September 2018.

Currently, Pulsar is viewed by StreamNative as a project that is part of the “stream data + Cloud native” space. So how does Pulsar, as a “cloud” oriented messaging infrastructure, implement its “streaming data” and “cloud native” features? Here Zhai jia introduces some technical features of Pulsar.

As a messaging infrastructure, Pulsar is bound to interact deeply with the storage, computing and computing layers.

On the storage side, Pulsar leverages the strengths of Apache BookKeeper to actively extend and exploit the strengths of the Pulsar architecture. Based on Pulsar’s sharding of Topic, we can naturally migrate old sharding from BookKeeper to secondary storage. An unlimited amount of streaming data can be supported through hierarchical storage. In addition, we are supporting the storage mode in secondary storage, which can support batch processing requests more efficiently, and truly realize the storage requirements of batch flow fusion.

For the computing layer, Pulsar’s idea is to embrace other open source projects more. We provide Schema support in Pulsar to ensure that data in Pulsar can be understood by other systems in terms of data structure. StreamNative’s open source PulSAR-Spark and PulSAR-Flink connectors are examples of better integration with other big data engines. Pulsar SQL is also directly integrated with Presto to support data queries in Pulsar.

Function lightweight computing (Pulsar Functions) is an innovation of Pulsar combined with Serverless concept in the messaging field. The function is simple to write, and at runtime, each message triggers the calculation of the function once. This lightweight computing tool is a good supplement to Spark and Flink, facilitating users to perform many common simple computing scenarios, such as data cleaning, routing, and Enhancement.

Pulsar offers a full set of event-based data processing platforms by leveraging its strengths in the storage tier and integrating with more big data ecosystems.

In the messaging space, Pulsar is the first open source project to bring storage and computing to the ground as a cloud-native architecture.

In addition to the hierarchical architecture of storage computing, the node peer-to-peer, resource pooling brought by large cluster management, and system elasticity brought by high availability in Pulsar perfectly fit the concept of cloud native.

Stream data can be divided into bounded stream and unbounded stream according to whether there is a definite start and stop position. The data flowing into each Topic in the messaging platform is a natural representation of the flow of events. Pulsar’s Pub/Sub interface makes it easier for the computing platform to process Topic as stream data. At the same time, BookKeeper, the storage layer at the bottom of Pulsar, divides a Topic into multiple bounded fragments, corresponding to data blocks in HDFS. Fragments in BookKeeper can be directly accessible, which is more convenient for batch data processing engine. Combined with Pulsar, batch flow unification in data processing can be accomplished more easily.

While computing engines such as Flink/Spark have good abstractions in the practice of batch streaming unification, there is not much work in the data storage layer. StreamNative believes that Pulsar’s architecture is well suited to the storage requirements of batch streaming fusion, which is Pulsar’s advantage in the data processing field.

StreamNative is supporting the store format in Pulsar’s secondary storage to make access to Pulsar’s batch engine more efficient. In this way, Pulsar provides a unified data storage layer, and the user only cares about the upper-layer data processing, but not the low-level data storage details.

Pulsar uses BookKeeper as the storage center

Pulsar uses BookKeeper as the storage center. BookKeeper provides a highly abstract API. Simply put, it is a distributed storage system that provides an infinite number of write-ahead-logs (WAL).

It has been more than 5 years since I graduated from BookKeeper and became a top project of Apache. During this period, with the deep use and positive contributions of Yahoo, Twitter, Salesforce, EMC and other companies, BookKeeper has been relatively stable and mature. StreamNative is driving BookKeeper community growth primarily through Pulsar community growth.

Add BookKeeper functionality according to Pulsar’s functional requirements. StreamNative also invites BookKeeper users to participate in online and offline events. The interaction between the two communities is also evident in the growing number of BookKeeper project Stars on GitHub.

Pulsar versus Kafka

Perhaps the most important question for developers is how good is Pulsar? Zhai Jia compares Pulsar with Kafka in three aspects to illustrate the characteristics and advantages of Pulsar:

First, from the application scenario, Pulsar provides a unified message model for users. On the one hand, it can meet the requirements of various MQ, such as RabbitMQ, ActiveMQ and other online transaction systems. On the one hand, it can meet the high throughput requirements of scenarios similar to Kafka. This allows Pulsar to have more use scenarios and needs ecologically than Kafka.

Second, architecturally, Pulsar has the advantage of a cloud-native architecture that separates storage and computing. Since no data is stored at the Broker level, this architecture provides users with higher availability, more flexibility in scaling and management, and avoidance of data rebalance and catch-ups.

As the storage layer of Pulsar, BookKeeper was born for metadata consistency service, which can guarantee high bandwidth and low latency to provide better consistency guarantee for users. Compared to Kafka, which relies on a file system for consistency, BookKeeper has a native consistency protocol, real-time data flush to disk, and read/write hardware isolation, all of which provide Pulsar with higher reliability and data service quality.

Third, in the community, Kafka had some first-mover advantage as the only option for streaming data at the time. Thanks to the strength of Pulsar’s architecture and functionality, attention and usage has grown rapidly over the past two years, as has the number of Pulsar users and contributors at home and abroad. In the first two weeks, the number of global contributors has surpassed 300.

At present, the company’s upgrading of messaging platform needs focus on reducing costs and convenient operation. The following features of Pulsar are in line with the trend of messaging platforms:

  • Cloud native, reduce the cost of system personnel, labor, operation and management
  • Large cluster, unified management and control system resources
  • Unified platform, convenient data sharing and management

Zhai concluded that Pulsar’s strengths lie in its unique design and layered system architecture. With the architecture and functions of Pulsar, users can deploy unified clusters to meet the requirements of various messaging scenarios within the entire organization through resource pooling and multi-tenancy, reducing complex management of small clusters or multiple clusters, improving resource utilization, and facilitating data sharing within clusters.

Pulsar also offers better data quality with BookKeeper, providing higher consistency and persistence while maintaining high bandwidth and low latency. From the perspective of operation and resource elasticity, the expansion capacity of Pulsar cluster is more rapid and convenient. Node replacement and update do not affect service reliability and availability. Pulsar also has considerable advantages over Kafka in terms of online deployment costs. StreamNative recently published a detailed comparison of Pulsar and Kafka:

Pulsar vs. Kafka — Part 1 — A More Accurate Perspective on Performance, Architecture, and Features

“Pulsar vs Kafka – Part 2 – Adoption, Use Cases, Differentiators, and Community”

StreamNative based on Pulsar

On the basis of understanding Pulsar, you can better understand StreamNative.

While developing and operating Pulsar, StreamNative mainly provides cloud hosting, operation and technical support services based on Pulsar. It should be noted that Pulsar is a top open source project under the Apache Foundation and is owned by a neutral foundation, which is the basis of trust for StreamNative to operate Pulsar in the community.

StreamNative was founded in 2019. The company’s developers were early contributors to Apache Pulsar, and many members are PMC members or committers of Apache Pulsar. Its co-founder and CEO, Sijie Guo, is the prototype designer and lead developer of Apache Pulsar.

According to Zhai, StreamNative’s two main focus areas are the Pulsar community and the cloud, which investors are very optimistic about.

In terms of products, StreamNative’s product is StreamNative Cloud, which provides enterprises with fully managed “Apache Pulsar as a service” on the Cloud, according to Zhai Jia. “Engineers familiar with Pulsar will be impressed by Pulsar’s memory and computing separation and hierarchical fragmentation of the elastic system architecture, which is one reason Why StreamNative often says Pulsar has a cloud-native architecture.” StreamNative’s services are suitable for customers with higher requirements for the quality, control and maintenance of Pulsar cluster operations.

The StreamNative team is also working on improving the Pulsar community.

Building the Pulsar community was focused on the product and its interaction with users, Contribute and improve the functions of Pulsar, enrich the surrounding ecology of Pulsar, help release and maintain the version of Pulsar, communicate with community users and help solve the obstacles encountered in launching Pulsar, organize and participate in relevant promotion activities such as Meetup of Pulsar, etc. These tasks are mainly to improve the documentation of Pulsar, reduce the threshold of entry of Pulsar, enrich the integration and interaction of Pulsar and other systems, and do the basic work for users to start.

“In the long run, it is more important to enrich the user scene of the community, attract more users to participate in the community, and build an active and sustainable community.” Zhai jia revealed that in addition to Pulsar’s own advantages, the recognition of the community and customers is also one of the reasons for the success of the pre-A round of financing. Currently, Pulsar has been widely used in front-line Internet scenarios.

In addition, StreamNative is also improving the StreamNative k8S-BASED cloud platform. StreamNative Cloud, a preview version of the Cloud platform, has been released on Google Cloud and is expected to release a preview version in China at the end of this year.

Open source projects and commercial companies

Zhai spent most of the interview introducing Pulsar. He argues that open source projects and commercial companies are mutually beneficial.

Behind every successful open source project is a commercial company that continuously provides core support to the community and its users: community users are attracted to the architecture and functionality of the open source product and participate in the community; In the process of serving the community, the company keeps getting feedback and innovation, and improves the comprehensive performance of open source products in various production environments. Companies reuse mature projects and legacy to serve communities in need — a virtuous circle. This is the same relationship between StreamNative, led by Pulsar’s core team, and the Pulsar community.

Speaking of business opportunities, Zhai Jia, as a technician, said that he has been holding the dream of “technology changes the world”. “Members of the Founding team of StreamNative have witnessed the construction and operation of 3000+ Pulsar storage cluster. Having seen and experienced Pulsar and BookKeeper over a long period of time on Yahoo and Twitter, it is clear that Pulsar has the advantages of architecture and functionality, as well as matching and consistency with the cloud-native direction. At the same time, we’ve seen a lot of recognition from developers for Pulsar’s architecture and product, and we’ve seen a lot of pain points from users.”

In addition, the model of open source commercialization in the past two years has been maturing, and commercial companies are rising behind Spark, ES, MongoDB, TiDB and other open source projects. StreamNative believes that Pulsar and StreamNative have the same opportunity.

However, with StreamNative spending so much time and effort on Pulsar and its community ecology, will Pulsar be tied to commercial companies and weaken the open source collaborative nature of the project?

Zhai jia thinks that this kind of binding is to promote, rather than weaken, open source collaboration.

At its core, Pulsar is all open source, with StreamNative and the community using the same code. StreamNative investing in the community leads to community trust and growth; Community feedback will lead to mature and innovative iterations of Pulsar; Eventually more users will believe in Pulsar and StreamNative.

Second, customers in finance, securities, retail, IoT and other sectors are embracing and using Pulsar. While StreamNative supports these customers, Pulsar has matured through different scenarios. By serving its customers, StreamNative can understand the needs of relevant verticals and continuously reach and satisfy more customers in the same domain.

Similar to Kafka, Pulsar is a project of the Apache Software Foundation. Linkedin became a user of Kafka after donating to it, and Yahoo is now a user of Pulsar. From the perspective of Linkedin and Yahoo, maintaining the use of the open source project (Kafka/Pulsar) in the original company to meet the business needs of the original company online is the most important thing. In addition, open source will make the project more mature and robust, which will benefit the original company.

To sum up, the Pulsar project itself will develop in the direction of cloud native. It is optimistic that StreamNative in this direction will invest absolute energy and time in Pulsar. On the one hand, it will help expand the ecology of Pulsar and ruminate its own commercial resources. On the other hand, take advantage of Pulsar and BookKeeper technology to distinguish other Pulsar service providers. These may be the reasons why investors are bullish on StreamNative now, and may be the reasons why more developers and users will choose StreamNative in the future.

“Apache”, “Apache Pulsar”, “Apache BookKeeper”, “Pulsar”, and “BookKeeper” are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the assets of their respective owners.