LinkedIn has opened source Brooklin, a distributed, scalable streaming data service in near real time. Brooklin has been running LinkedIn since 2016, processing thousands of data streams and 2 trillion messages a day.

Why Brooklin

Fast and reliable transmission of large-scale data is no longer the only problem LinkedIn needs to solve. The problem of rapidly increasing data storage is as serious as the problem of the diversity of streaming systems. LinkedIn built Brooklin to address the new need for system scalability, namely the system should be scalable in terms of data scale, And scalable in terms of system diversity.

What is the Brooklin

Brooklin is a distributed system for streaming data across multiple disparate data stores and messaging systems with high reliability. It exposes a series of abstractions that extend Brooklin’s capabilities to consume and generate data from new systems by writing new Brooklin consumers and producers.



Brooklin cases

Brooklin has two broad categories of use cases: flow Bridges and change data capture.

Data Stream Bridge

Data can be distributed in different environments (public cloud and corporate data center), geographically, or in different deployment groups. Often, additional complexity is added due to requirements such as access mechanisms, serialization formats, compliance, and security. Brooklin can act as a bridge for transferring data between these environments. Brooklin, for example, can transfer data between different cloud services, between different clusters in a data center, and even between different data centers.



Figure 2. A hypothetical scenario: A Brooklin cluster is used as a stream bridge to transfer data from Kinesis to Kafka and from Kafka to EventHubs



Because Brooklin is a proprietary service that streams data between different environments, all complexity can be managed in a single service, so developers can focus on processing data rather than transferring it. In addition, this centralized, managed, extensible framework enables organizations to enforce policies and facilitate data governance. For example, Brooklin can configure to enforce company-wide policies, such as all data streams must be IN JSON format, or any data in the stream must be encrypted.

Kafka mirroring

Before Brooklin, data transfer between Kafka clusters was implemented through Kafka MirrorMaker(KMM), but it had problems with large scale transmission. Brooklin is designed to be a universal bridge for streaming data, so it can easily transfer kafka data at very large scales.

One example of using Brooklin as a flow bridge at LinkedIn is mirroring Kafka data between Kafka clusters and data centers.



Figure 3 a hypothetical scenario: With Brooklin, it is easy for users to access all data in one data. A single Brooklin cluster can handle multiple source/destination pairs in each data center.

Brooklin’s solution for mirroring Kafka data has been extensively tested on LinkedIn and has completely replaced KMM. By using this scenario, some of KMM’s pain points are addressed and it benefits from its new features, which are discussed below.

Feature 1 Multitenancy

In the KMK deployment model, mirroring can only take place between two Kafka clusters. This results in building one KMM for each data pipeline, resulting in dozens or hundreds of KMM clusters, which can be very difficult to manage. However, Brooklin is designed to manage different data pipelines at the same time, so we only need to deploy a Brooklin cluster. By comparing Figure 3 with Figure 4, you can get a more intuitive feeling.



Figure 4 shows a hypothetical scenario where KMM is used to mirror data across data centers

Feature 2 Dynamic Provisioning and Management

With Brooklin, creating a new data pipeline (also known as a data stream) and modifying an existing one can be done with a simple HTTP call to a REST endpoint. For the Kafka mirroring use case, this endpoint can easily create a new mirroring pipe or modify the mirroring whitelist of an existing pipe without changing and deploying the static configuration.

While mirrored pipes can coexist in the same cluster, Brooklin can control and configure each pipe individually. For example, you can edit a pipe’s mirror whitelist or add more resources to a pipe without affecting any other pipes. In addition, Brooklin allows individual pipes to be paused and resumed on demand, which is useful when temporarily operating or modifying pipes. For the Kafka mirroring use case, Brooklin supports pausing or resuming an entire pipe, a single topic in a whitelist, or even a single topic partition.

Feature 3 Diagnostics

Brooklin also exposes a diagnostic REST endpoint that can query the state of a data flow on demand. This API makes it easy to query the internal state of a pipe, including any single topic partition lag or errors. Because diagnostic endpoints integrate all the findings of the entire Brooklin cluster, this is very useful for quickly diagnosing problems with a particular partition without scanning log files.

Special features

Since Brooklin is being used to replace KMM, Brooklin is optimized for stability and operability. There are also features specifically for Kafka mirroring.

  1. Fault isolation: Most importantly, we strive for better fault isolation so that errors mirroring a specific partition or topic do not affect an entire pipeline or cluster as they would with KMM. Brooklin can detect errors at the partition level and automatically suspend the mirroring of those troubled partitions. These auto-paused partitions can be automatically restored after a configurable period of time, which eliminates the need for manual intervention and is especially useful for transient errors. Meanwhile, the processing of other partitions and pipes is not affected.
  2. Brushless generation mode: To improve mirroring latency and throughput, Brooklin Kafka mirroring can also run in flushless produce mode, where Kafka consumption progress is tracked at the partition level. Checkpoints are done on a per-partition basis rather than at the pipe level. This allows Brooklin to avoid making expensive Kafka producer refresh calls, which are synchronous blocking calls that can often bring the pipeline to a halt for a few minutes.

By using Brooklin instead of KMM, LinkedIn reduced its mirror cluster from a few hundred to a dozen. At the same time, this increases the rate of iteration to add new features and improvements.


Change data capture (CDC)

Brooklin’s second major use case is change data capture. The goal in these cases is to stream database updates as a low-latency change stream. For example, most of LinkedIn’s real data (such as jobs, connections, and profile information) resides in various databases. Some applications are interested in knowing when to publish a new assignment, make a new professional connection, or update a member’s profile. Rather than having each of these interested applications run expensive queries against the online database to detect these changes, instead of streaming these database updates in real time. One of the biggest advantages of using Brooklin to generate change data capture events is better resource isolation between applications and online storage. Applications can scale independently of the database, avoiding the risk of database outages. With Brooklin, we built change data capture solutions for Oracle, Espresso, and MySQL on LinkedIn; In addition, Brooklin’s extensible model facilitates writing new connectors that add CDC support to any database source.



Figure 5

Feature 1 Bootstrap Support

Sometimes, an application may require a full snapshot of the data store before using incremental updates. This can happen when the application is started for the first time or when an entire data set needs to be reprocessed due to a change in processing logic. Brooklin’s extensible connector model supports this use case.

Feature 2 Transaction Support

Many databases have transaction support, and for these sources, the Brooklin connector ensures that transaction boundaries are maintained.

For more information

For more information about Brooklin, including an overview of its architecture and capabilities, check out our previous engineering blog post.

In Brooklin’s first release, we introduced Kafka mirroring, which allows us to test drives using simple instructions and scripts we provide. We are working hard to add support for more sources and destinations to the project – stay tuned!

In the future

Brooklin has been successfully running in the LinkedIn generation environment since October 2016. It has replaced Databus as the change capture solution for Espresso and Oracle sources and is our flow bridge solution for moving data between Azure, AWS and LinkedIn, including mirroring trillions of messages a day from our many Kafka clusters.

We will continue to build connectors to support other data sources (MySQL, Cosmos DB, Azure SQL) and targets (Azure Blob storage, Kinesis, Cosmos DB, Couchbase). We also plan to add optimization features for Brooklin, such as the ability to scale automatically based on traffic requirements, the ability to skip decompression and recompression of messages in mirroring schemes to improve throughput, and additional read and write optimizations.

The project address

Github

conclusion

  1. Brooklin is a distributed data streaming service that connects with different types of data stores and messaging systems
  2. Brooklin has implemented Kafka mirroring and offers multi-tenant, dynamic configuration and management, diagnostics, fault isolation, brushless build mode and more.
  3. Brooklin also provides CDC functionality, but currently only supports Espresso and Oracle sources, without Mysql, Azure SQL and other connectors