The translator preface

The original address: http://notes.stephenholiday.com/Kafka.pdf

Too long to see:

Compared to other messaging systems such as JMS, Kafka has lost a lot of functionality in order to achieve performance gains.

The paper describes the design trade-offs of Kafka and many points to improve performance.

Abstract

Log processing has become an important part of the data pipeline for consumer Internet companies.

We’ll start by introducing Kafka, a distributed messaging system we developed for collecting and delivering large volumes of logging data with low latency.

Kafka combines the ideas of existing log aggregators and messaging systems for consuming both offline and online messages.

We did a lot of unconventional but practical design in Kafka to make our system efficient and scalable.

Our experimental results show that Kafka has superior performance compared to the two popular messaging systems.

We’ve been using Kafka in production for some time now, processing hundreds of gigabytes of new data every day.

1. Introduction

Any large Internet company generates a lot of “log” data.

These data typically include:

  • User activity events, including logins, page views, clicks, “likes”, shares, comments, and search queries

  • Operational metrics, such as service call stack, call latency, errors, and system metrics, such as CPU, memory, network, or disk utilization.

Log data has long been an integral part of analytics to track user engagement, system utilization, and other metrics.

However, recent trends in Internet usage have made active data part of the product data pipeline, directly used in web site functionality.

These uses include:

  • Search relevance
  • Recommendations generated by popular or co-occurring items in the activity stream
  • Advertising targeting and reporting
  • Secure applications that prevent abusive behavior, such as spam or unauthorized data crawling
  • A news broadcast feature that aggregates a user’s status updates or actions for their “friends” to read.

This production, real-time use of log data presents new challenges for data systems because the volume of data is orders of magnitude larger than “real” data.

For example, search, recommendations, and advertising often require granular click-through rates, which produce not only a log of each user click, but also dozens of unclicked items on each page.

China Mobile collects 5-8 terabytes of phone records per day, while Facebook collects nearly 6 terabytes of various user activity events per day.

Many of the early systems that handled this kind of data relied on actually collecting log files from production servers for analysis.

In recent years, several specialized distributed log aggregators have been released, including Facebook’s Scribe, Yahoo’s Data Highway, and Cloudera’s Flume.

These systems are primarily designed to collect log data and load log data into a data warehouse or Hadoop for offline consumption.

At LinkedIn, we found that we needed to support most of the above real-time applications with a delay of no more than a few seconds, in addition to traditional offline analytics.

We built a new messaging system for log processing, called Kafka, that combines the advantages of traditional log aggregators and messaging systems.

On the one hand, Kafka is distributed and scalable, and provides high throughput.

Kafka, on the other hand, provides a messaging system-like API that allows applications to consume logging events in real time.

Kafka is open source and has been used successfully for more than six months in the production of LinkedIn.

It greatly simplifies our infrastructure because we can use a single piece of software to consume various types of log data both online and offline.

The rest of this article is arranged as follows.

  • In Section 2, we took a fresh look at traditional messaging systems and log aggregators.
  • In Section 3, we describe Kafka’s architecture and its key design principles.
  • In Section 4, we describe Kafka we deployed on LinkedIn
  • The performance results for Kafka are described in section 5.
  • We discuss future work in Section 6
  • This is summed up in section 6.

2. Related work

Traditional enterprise messaging systems have been around for a long time and often play a crucial role in the event bus that handles asynchronous data flows.

However, there are several reasons why they tend not to adapt well to logging.

First, there is a mismatch between the features provided by enterprise-level systems and what logging should be. Those systems tend to focus on providing rich delivery guarantees.

For example, IBM Websphere MQ has transactional support, allowing an application to atomically insert messages into multiple queues.

The JMS specification allows each message to be consumed after it is consumed, possibly in an out-of-order order. (Don’t understand, don’t understand JMS, imagine next, disorderly order consumption and idempotent meaning?)

Such delivery guarantees are often overcorrected for collecting log data. The occasional missing page view event is certainly not the end of the world.

Unneeded functionality often adds complexity to the apis and underlying implementations of these systems.

Second, many systems are not so focused on throughput as they are on the primary design constraint function. For example, JMS has no API that allows producers to explicitly batch multiple messages to a single request. This means that each message requires a full TCP/IP round trip, which is not feasible for the throughput requirements of our domain.

Third, those systems are weak in distributed support. There is no easy way to partition and store messages across multiple machines.

Finally, many messaging systems assume that messages will be consumed in near real-time, and the amount of unconsumed messages is always quite small.

As a result, their performance is significantly reduced if message accumulation occurs. For example, when offline consumers such as data warehouses make periodic heavy load consumption of the messaging system rather than continuous consumption of data.

Over the past few years, several specialized log aggregators have been established.

For example, Facebook uses a system called Scribe, where each front end machine can send log data to a set of Scribe machines over the network.

Each Scribe machine aggregates log entries and periodically dumps them to HDFS or NFS devices.

Yahoo’s data Superhighway project has a similar approach to data delivery, with a group of machines aggregating events from clients, saving them as files by the minute, and then transferring them

It is added to HDFS.

Flume is a relatively new log aggregator developed by Cloudera. It supports extensible “pipes” and “data sinks” to enable streaming of log data

It’s very flexible. It also has more integrated distributed support.

However, most of these systems are built to consume log data offline, often exposing implementation details such as “files saved by the minute” to consumers unnecessarily.

In addition, most of them use a “push” model, in which brokers forward data to consumers.

At LinkedIn, we find that the “pull” model works better for our application because each consumer can retrieve consumption at the maximum rate they can afford

To avoid being swamped by notifications that are pushed faster than you can handle.

Pull mode also allows consumers to easily return to us in

The details of this benefit are discussed at the end of Section 3.2.

Recently, Yahoo Research has developed a new distributed PUB/Sub system called HedWig. HedWig is highly extensible and scalable

Usability, and provides a strong persistence guarantee. However, it is primarily used to store commit logs for the Data Store.

3. Kafka architecture and design principles

Due to the limitations of existing messaging systems, we develop a new log aggregator Kafka based on messaging.

Let’s start by introducing the basic concepts in Kafka.

A topic defines a particular type of message flow.

A producer can publish messages to a topic. The published messages are then stored on a set of servers called brokers.

A consumer can subscribe to one or more topics from the Broker and consume the subscribed messages by extracting data from the Broker.

Conceptually, the definition of messaging is relatively simple. Similarly, we tried to make the Kafka API just as simple. And to prove it, let’s

Instead of showing the API, I’ll show you some sample code to show how to use the API.

Sample code for the producer is shown below. A message is defined to contain only one byte of content. Users can choose their preferred serializer

Method to encode a message. To improve efficiency, producers can send a set of messages in a single publish request.

Sample producer code:

producer = newProducer(...) ;message = newThe Message (" test Message STR. "getBytes ());set = new MessageSet(message);
Producer. The send (" topic1 ", set ";Copy the code

To subscribe to a topic, a consumer first creates one or more message flows (understood as partitions) for that topic.

Messages published to this topic are evenly distributed among these submessage flows (partitions).

The details of how Kafka allocates messages are described in section 3.2 below.

Each message flow provides an iterator interface on a continuously generated message flow.

The consumer iterates over each message in the message flow and processes the content of the message.

Unlike traditional iterators, message flow iterators never terminate.

If there are currently no more messages to consume, the iterator blocks until a new message is published to the topic.

We support both the point-to-point delivery pattern, where multiple consumers collectively consume a single copy of all messages in a topic, and multiple consumers individually

A publish/subscribe mode for requesting a copy of a topic.

Sample consumer code:

Streams [] = Consumer. CreateMessageStreams (" topic1 ",1) for (message : streams[0]) {

bytes = message.payload();
 // do something with the bytes

} Copy the code

The overall architecture of Kafka is shown in Figure 1.

Because Kafka is distributed, a Kafka cluster usually consists of multiple brokers.

To balance the load, a topic is divided into partitions, with each Broker storing one or more partitions.

Multiple producers and consumers can publish and consume messages simultaneously.

In Section 3.1, we describe the layout of the individual partitions on the Broker, along with some of the design choices we chose to make access to the partitions more efficient.

In Section 3.2, we describe how producers and consumers interact with multiple brokers in a distributed setup.

In Section 3.3, we’ll talk about Kafka’s delivery guarantee.

3.1 Performance of a Single Partition

We made some design decisions in Kafka to make the system more efficient.

1. Simple storage: Kafka has a very simple storage layout.

Each partition of a topic corresponds to a logical log.

Physically, a log is implemented as a set of segment files of roughly the same size (for example, 1GB).

Every time a producer publishes a message to a partition, the Broker simply appends the message to the last segment file.

For better performance, we only flush segment files to disk after a certain number of messages have been published, or after a certain amount of time has been published.

A message is not exposed to consumers until it is refreshed.

Unlike typical messaging systems, messages stored in Kafka do not have explicit message ids.

Instead, each message is addressed by its logical offset in the log.

This avoids the overhead of maintaining index structures used to aid queries that map message ids to the actual message location.

Note that the message ids we mentioned are incremental, but not continuous. To calculate the ID of the next message, we must add the length of the current message to its ID.

From now on, we will alternate message ids and offsets.

Consumers always sequentially consume messages from a particular partition.

If the consumer confirms a particular message offset, it means that the consumer has received all messages in the partition up to that offset.

In a real run, the consumer issues an asynchronous pull message request to the Broker so that a buffer of data is ready for consumption by the application.

Each pull message request contains an offset and an acceptable number of bytes from the message that was consumed.

Each Broker keeps a sorted list of offsets in memory, including the offsets of the first message in each segment file. The Broker searches the list of offsets to locate the segment file in which the requested message resides and sends the data back to the consumer.

When a consumer receives a message, it calculates the offset of the next message to consume and uses it in the next pull request.

The layout of Kafka logs and in-memory indexes is shown in Figure 2. Each box shows the offset of a message.


2. Efficient transfer: We are very careful when transferring data in Kafka.

Earlier, we showed that a producer can submit a set of messages in a single send request.

Although the consumer API iterates one message at a time, in practice, each consumer pull request also retrieves multiple consumptions

Interest rates. A transfer is typically several hundred K bytes in size.

Another unconventional choice we made was to avoid caching messages in memory at the Kafka level.

Instead, we rely on the page cache of the underlying file system.

The main benefit of doing this is to avoid double buffering, where messages are only cached in the page cache.

This has the added benefit of preserving the warm cache even when the agent process restarts.

Since Kafka does not cache messages in the process at all, it has very little overhead in terms of garbage collection memory, which makes it easier to build on VMS

Efficient implementation is possible in the language of.

Finally, since both producers and consumers access segment files sequentially, the consumer tends to be a bit later than the producer, so this is normal

Operating system cache heuristic caching is very efficient (cache direct write and preread).

We found that both producer and consumer performance is linear with data size, with the largest data volume reaching many terabytes. (Failed to understand)

In addition, we have optimized consumer web access.

Kafka is a multi-consumer system, where a message can be consumed multiple times by different consumer applications.

A typical way to send bytes from a local file to a remote socket involves the following steps.

  1. Read data from the storage medium into the page cache in the operating system
  2. Copy the data from the page cache to the application buffer
  3. Copy the application buffer to another kernel buffer
  4. Send the kernel buffer to the Socket.

This includes four data replicates and two system calls.

On Linux and other Unix operating systems, there is a SendFile API that transfers bytes directly from the file channel to the socket

Channel. This usually avoids the two copies and one system call described in steps (2) and (3).

Kafka leverages the SendFile API to efficiently pass bytes from a log segment file from the proxy server to the consumer.

3. Stateless Broker: Unlike most other messaging systems, in Kafka, the number of messages consumed by each consumer is a message

Information is not maintained by the Broker, but by consumers themselves. This design reduces a lot of complexity and reduces the number of brokers

Overhead.

However, this makes deleting the message tricky because the Broker does not know if all users have consumed the message.

Kafka solves this problem by using a simple time-based SLA retention policy.

If a message remains in the broker for more than a certain amount of time, typically seven days, it is automatically deleted.

This scheme works well in practice. Most consumers, including offline consumers, do it daily, hourly, or in real time

Consumption. Since Kafka’s performance does not deteriorate as the amount of data increases, this long retention scheme is feasible.

This design has an important side effect.

A consumer can deliberately backtrack to an old offset and re-consume data.

This violates the general rules of queues, but has proved to be an essential feature for many consumers.

For example, when there is an error in the application logic in the consumer, the application can play back some messages after the error is fixed. This is especially important for ETL data loading in our data warehouse or Hadoop systems.

For another example, consumed data may simply periodically be flushed to a persistent store (for example, a full-text indexer).

If the consumer crashes, unflushed data is lost. In this case, the consumer can check the minimum offset of the unflushed message and re-consume from that offset on reboot.

We noticed that it was much easier to support consumer re-consumption in the pull model than in the push model.

3.2 Distributed coordination processing

Now let’s explain how producers and consumers perform in a distributed environment.

Each producer can publish messages to a random partition or partition determined by the semantics of the partition key and partition function. We will focus on elimination

How the customer interacts with the Broker.

Kafka has the concept of consumer groups.

Each consumer group consists of one or more consumers that collectively consume a set of subscribed topics, that is, each message is delivered only to

A consumer within a consumer group.

Different consumer groups consume the full set of subscriptions independently of each other, without the need for cross-consumer group coordination.

Consumers within the same group can be in different processes or on different machines.

Our goal is to evenly distribute messages stored in the Broker to all in the group without introducing excessive coordination overhead

The consumer.

Our first decision was to make a partition within a topic the smallest parallel unit.

This means that all messages for a partition are consumed by only one consumer in each consumer group at any one time.

Assuming that we allow multiple consumers to consume a partition at the same time, they would have to coordinate who consumes what messages, which would require locking and state maintenance, incurs some additional overhead.

Instead, in our design, the consumption process only needs to be coordinated as consumers rebalance their load, which normally doesn’t happen very often.

To truly balance the load, we need many more partitions in a topic than consumers in each consumer group.

We can do this by having more partitions on a topic.

The second decision we made was not to have a centralized “master” node, but to have consumers coordinate with each other in a decentralized way.

Adding a master node complicates the system because we have to worry more about master node failure.

To facilitate coordination, we used Zookeeper, a highly available consensus service.

Zookeeper has a very simple file system-like API.

One can create a path, set a path’s value, read a path’s value, delete a path’s value, and list a path’s subpaths.

It can also do something more interesting.

  • You can register a Watcher on a path to be notified when a path’s subpath or path’s value changes
  • Paths can be created as temporary (as opposed to persistent), which means that if the created client is no longer there, the path is automatically deleted by the Zookeeper server
  • Zookeeper replicates data to multiple servers, which ensures high data reliability and availability.

Kafka uses Zookeeper to accomplish the following tasks.

  • Detect the addition and deletion of brokers and consumers

  • When the above event occurs, a rebalancing process is triggered in each consumer

  • Maintain consumption relationships and track consumption offsets for each partition.

Specifically, when each Broker or consumer is started, it stores its information in the Broker or consumer registry in Zookeeper.

The Broker registry contains the Broker’s host name and port, as well as the topics and partitions stored on it.

The consumer registry includes the consumer group to which the consumer belongs and the collection of topics to which it subscribes.

Each consumer group is associated with an ownership registry and an offset registry in Zookeeper.

The ownership registry has a path for each subscribed partition, and the path value is the consumer ID currently consumed from the partition (the term we use is that the consumer owns the partition).

The offset registry stores for each subscribed partition the offset of the last message consumed in that partition.

The paths created in Zookeeper for Broker, consumer, and ownership registries are all temporary.

Paths created in the offset registry are persistent.

If a Broker server fails, all partitions on it are automatically removed from the Broker registry.

A failure of a consumer results in the loss of its records in the consumer registry and all partition records in the ownership registry.

Each consumer registers a Zookeeper Watcher on both the Broker registry and the consumer registry, each time a Broker collection or consumer group

We will be notified when changes occur.


During the initial startup process of the consumer, or when the consumer receives notification of Broker/ consumer changes through Watcher, the consumer initiates one

A rebalancing process to determine which new partition it should consume.

This process is described in Algorithm 1.

By reading Broker and consumer registries from Zookeeper, the consumer first calculates the collection of availability zones (PT) and the collection of consumers (CT) for each subscription topic T.

For each partition that the consumer selects, it writes itself in the ownership registry as the new owner of that partition.

Finally, the consumer starts a thread to pull data from the owned partition, with offsets starting from recorded values stored in the offset registry.

As messages are pulled from the partition, the consumer is periodically updated with the latest consumed offset in the offset registry.

When there are multiple consumers in a consumer group, each consumer is notified of Broker or consumer changes.

However, notifications arrive at slightly different times for each consumer.

Thus, it is possible that one consumer is trying to take ownership of a partition that is still owned by another consumer.

When this happens, the first consumer simply releases all partitions it currently owns, waits for a while, and then tries again to rebalance.

In practice, the rebalancing process usually stabilizes after just a few retries.

When creating a new consumer group, there are no offsets available in the offset registry.

In this case, the consumer will use the API we provide on the Broker to start with the minimum or maximum offset available on each subscription partition

Depends on configuration).

3.3 Delivery Guarantee

In general, Kafka only guarantees delivery of semantics at least once.

The exact one-delivery semantics usually require two-phase commits and are not required for our application.

In most cases, a message is delivered exactly once to each consumer group.

However, when a consumer group process crashes without cleanly shutting down, the newly taken over consumer process may get duplicate messages

After the last migration was successfully submitted to ZooKeeper.

If an application is concerned with repetition, then it must add its own logic to repeat, either using the offset we return to the consumer, or

Use some unique key in the message. This is usually a more economical approach than using two-phase commit.

Kafka ensures that messages from a single partition are delivered sequentially to consumers.

However, Kafka does not guarantee the order of messages from different partitions.

To avoid log corruption, Kafka stores a CRC for each message in the log.

If there are any I/O errors on the Broker, Kafka runs a recovery process to remove messages with inconsistent CRCS.

Having CRC at the message level also allows us to check for network errors after a message is produced or consumed.

If a Broker goes down, any unconsumed information stored on it becomes unavailable.

If the storage system on a Broker is permanently corrupted, any unconsumed messages are lost forever.

In the future, we plan to add replication to Kafka to store each message redundantly across multiple brokers.

4. Kafka’s practice at LinkedIn

In this section, we’ll cover how we use Kafka on LinkedIn.

Figure 3 shows the simplified version of our deployment.

In each data center running user-facing services, we deploy a Kafka cluster.

The front-end service generates various log data and publishes it in batches to local Kafka brokers.

We rely on the hardware load balancer to distribute publish requests evenly to Kafka’s brokers.

Kafka’s online consumers run within a service in the same data center.


We also deployed a separate Kafka cluster for offline analysis in each data center, which is geographically close to our Hadoop cluster

And other data warehouse infrastructure.

This Kafka instance runs a set of embedded consumers that pull data in real time from the Kafka instance in the data center.

We then run the data load task to pull data from this Kafka replication cluster into Hadoop and our data warehouse, where we run

Line all kinds of report work and data analysis processing.

We also used the Kafka cluster for prototyping and the ability to run simple scripts for real-time queries against the raw stream of events.

Without much tweaking, the end-to-end delay of the entire pipe averaged about 10 seconds, which was sufficient for our requirements.

Currently, Kafka accumulates hundreds of gigabytes of data and nearly a billion messages per day.

We expect this number to grow significantly as we complete the migration of legacy systems.

More types of messages will be added in the future.

The rebalancing process can automatically redirect consumption when an operator starts or stops the Broker for software or hardware maintenance.

Our trace system also includes an audit system to verify that data is not lost throughout the pipeline.

For convenience, each message has a timestamp and a server name.

We instrumented each producer to periodically generate a monitoring event that records the producer’s publication for each topic in a fixed time window

The number of messages.

Producers publish monitoring events to a separate topic in Kafka.

Consumers can then count the number of messages they receive from a given topic and validate those counts against monitoring events to validate the data

Is correct.

Loading into a Hadoop cluster is done by implementing a special Kafka input format that allows MapReduce jobs to be loaded directly from Kafka

To read data.

MapReduce jobs load raw data, then group and compress it for efficient processing in the future.

The stateless Broker and client store message offset again comes into play here, allowing MapReduce task management (which allows tasks to fail and rework)

Start) handles the data load in a natural manner without repeating or losing messages when the task restarts.

Data and offsets are stored in HDFS only after a task is successfully completed.

We chose to use Avro as our serialization protocol because it is efficient and supports schema evolution.

For each message, we store its Avro mode ID and serialized bytes in the payload.

This pattern allows us to enforce a contract to ensure compatibility between data producers and consumers.

We use a lightweight schema registry service to map schema ids to actual schemas.

When a consumer gets a message, it looks in the schema registry to retrieve the schema, which is used to decode bytes into objects (this lookup only

You need to do this once for each schema because the values are immutable).

5. Experimental results

We conducted a pilot study that paired Kafka with Apache ActiveMQ V5.4 (a popular open source implementation of JMS) and messages known for their performance

System RabbitMQ V2.4 is compared.

We used ActiveMQ’s default persistent message store, KahaDB.

Although not covered here, we also tested another AMQ message store and found that its performance was very similar to KahahaDB.

Whenever possible, we try to use comparable Settings across all systems.

We ran the experiment on 2 Linux machines, each with 8 2GHz cores, 16GB of RAM, 6 disks, and RAID 10.

The two machines are connected by a 1Gb network link. One machine acts as a Broker and the other as a producer or consumer.

Producer test:

Brokers on all systems are configured to asynchronously flush messages to their persistent disks.

For each system, we ran a single producer to publish a total of 10 million messages, each 200 bytes in size.

We configure the Kafka producer to send messages in batches of 1 and 50.

ActiveMQ and RabbitMQ do not seem to have a simple message batching method, assuming it uses a batch size of 1, as shown in Figure 4.

The X-axis represents the amount of data sent to the Broker over time in MB, and the Y-axis corresponds to producer throughput in messages per second.

On average, Kafka can publish messages at 50,000 and 400,000 messages per second, respectively, with batch sizes of 1 and 50.

These numbers are orders of magnitude higher than ActiveMQ and at least twice higher than RabbitMQ.


Kafka performs much better for several reasons.

First, the Kafka producer currently does not wait for a Broker receipt and sends messages as fast as the Broker can process them.

This significantly increases the throughput of the publisher.

At a batch capacity of 50, a single Kafka producer fills almost 1Gb of bandwidth between producer and Broker.

This is an effective optimization in the case of log aggregation, because the data must be sent asynchronously to avoid introducing any delays in real-time service traffic.

We also note that the broker cannot guarantee that every message published by producer will be received by the producer without sending back ack.

For different types of log data, persistence in exchange for throughput is desirable as long as the number of messages dropped is relatively small. However, we do plan to

More critical data persistence issues will be addressed in the future.

Second, Kafka uses a more efficient storage format.

Normally, the overhead per message is 9 bytes in Kafka and 144 bytes in ActiveMQ.

This means that ActiveMQ uses 70% more space than Kafka to store the same 10 million messages.

One overhead of ActiveMQ comes from the heavy headers required by JMS.

Another cost is the cost of maintaining the various index structures.

We observed that one of the busiest threads in ActiveMQ spent most of its time accessing b-Tree to maintain message metadata and state.

Finally, batch processing greatly improves throughput by amortizing RPC overhead. In Kafka, the batch capacity of 50 messages is almost an order of magnitude higher

Throughput.

Consumer testing:

In the second experiment, we tested consumer performance.

Again, for all systems, we used a single consumer to retrieve a total of 10 million messages.

We configured all systems to prefetch roughly the same amount of data per pull request — up to 1000 messages or about 200KB.

For ActiveMQ and RabbitMQ, we set the consumer confirmation mode to automatic.

Since all messages fit in memory, all systems supply numbers from the underlying file system’s page cache or from some in-memory buffer

According to.

The result is shown in Figure 5.


Kafka consumes 22,000 messages per second on average, more than four times as many as ActiveMQ and RabbitMQ.

We can think of several reasons.

First, because Kafka has a more efficient storage format, the consumer transfers fewer bytes from the Broker.

Second, brokers in ActiveMQ and RabbitMQ must maintain the delivery state of each message.

We observed that one of the ActiveMQ threads in the ActiveMQ thread was busy writing KahaDB pages to disk in this test.

In contrast, there is no disk write activity on the Kafka agent.

Finally, Kafka reduces the transport overhead by using the SendFile API.

At the end of this section, we will point out that the purpose of the experiment is not to show that other messaging systems are inferior to Kafka.

After all, ActiveMQ and RabbitMQ both have more features than Kafka.

The main point is to illustrate the performance gains that a customized system can bring.

6. Summary and future outlook

We came up with a new system called Kafka for processing massive log data streams.

Like normal messaging systems, Kafka employs a pull-based consumption model that allows applications to consume data at their own pace and on demand

Rewind the tape whenever you need it.

By focusing on log processing applications, Kafka achieves higher throughput than traditional messaging systems.

It also provides built-in distributed support and can be extended. We have successfully used Kafka for both offline and online applications at LinkedIn

To use.

Going forward, we have several directions.

First, we plan to add built-in message replication across multiple brokers, even in the case of machine failures that cannot be recovered

For persistence and data availability assurance.

We want to support both asynchronous and synchronous replication models to allow some tradeoff between producer latency and the strength of guarantee provided.

An application can choose the appropriate level of redundancy based on its requirements for persistence, availability, and throughput.

Second, we want to add some streaming capabilities to Kafka.

After retrieving messages from Kafka, real-time applications often perform similar operations, such as window-based counting, and associate each message with the

Record or connect to messages in another flow.

At the lowest level, this operation is supported during publishing by semantically partitioning messages on the Join key so that all messages sent with a specific key

Messages go to the same partition and thus to a single consuming process.

This provides a foundation for handling distributed flows in a consumer cluster.

On this basis, we feel that a useful library of information flow utilities, such as different windowing functions or join techniques, would be beneficial for such applications.

Reference 7.

  1. http://activemq.apache.org/
  2. http://avro.apache.org/
  3. Cloudera ‘s Flume, https://github.com/cloudera/flume
  4. http://developer.yahoo.com/blogs/hadoop/posts/2010/06/ena bling_hadoop_batch_processi_1/
  5. Efficient data transfer through zero copy: https://www.ibm.com/developerworks/linux/library/j- zerocopy/
  6. Facebook’s Scribe, http://www.facebook.com/note.php?note_id=32008268919
  7. IBM Websphere MQ: http://www- 01.ibm.com/software/integration/wmq/
  8. http://hadoop.apache.org/
  9. http://hadoop.apache.org/hdfs/
  10. http://hadoop.apache.org/zookeeper/
  11. http://www.slideshare.net/cloudera/hw09-hadoop-based- data-mining-platform-for-the-telecom-industry
  12. http://www.slideshare.net/prasadc/hive-percona-2009
  13. https://issues.apache.org/jira/browse/ZOOKEEPER-775
  14. JAVA Message Service: http://download.oracle.com/javaee/1.3/jms/tutorial/1_3_1- FCS/doc/jms_tutorialTOC. HTML.
  15. Oracle Enterprise Messaging Service: http://www.oracle.com/technetwork/middleware/ias/index- 093455.html
  16. http://www.rabbitmq.com/
  17. TIBCO Enterprise Message Service: http://www.tibco.com/products/soa/messaging/
  18. Kafka, http://sna-projects.com/kafka/

This article is formatted using MDNICE