The profile

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources into a centralized data store.

Flume is not only used for log data. Because data sources can be customized, Flume can be used to transmit large amounts of event data, not just network traffic data, social media generated data, email messages, and so on.

Apache Flume is a top-level project of the Apache Foundation, developed and maintained by Cloudera prior to joining Apache. Apache Flume is available in two major versions: 0.9.x and 1.x. X is the historical version, which is called Flume OG (Original Generation). On October 22, 2011, Cloudera completed Flume-728 with the following milestone changes: The refactoring of core components, core configuration, and code architecture is called Flume NG (Next Generation), which is also called 1.x version.

This document introduces the functions and core concepts of Flume, and provides a general understanding of Flume usage scenarios, core components, and operating mechanisms of each component. How to configure Flume for different scenarios will be explained in detail in another article.

architecture

Data flow model

A Flume event is defined as a data flow unit. Flume Agent is actually a JVM process, which contains various components required to complete tasks. The three core components are Source, Chanel and Slink.

Source consumes events passed to it by an external Source, such as a Web server. External sources send data to Flume in a format defined by the target Flume Source. For example, an Avro Flume source can receive Avro events from an Avro client (Avro is a high-performance middleware based on binary data transmission and is a subproject of Hadoop), You can also receive Avro events from other Flume agents that have Avro sink. Similarly, we can define a Thrift Flume Source to receive events from Thrift Sink, the Flume Thrift RPC client, or any other client that can be written in any language as long as the Flume Thrift protocol is met.

A channel can be understood as a cache that holds data from the Source until Flume Slink consumes the data. An example is File Chanel, which keeps data in a file system (though you can keep data in memory, of course).

After the data is consumed from the channel, the slink deletes the data from the channel and dumps the data to an external storage system such as HDFS (using Flume HDFS sink) or sends the data to other Flume Agent sources. Both Source and Slink send and consume data asynchronously.

The complex flow

Flume allows users to build a complex data flow, such as data flowing through multiple agents to land. It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.

reliability

Events are stored in each agent’s channel. These events are then sent to the next agent or device store (such as HDFS) in the stream. The current channel clears the event only when the event has been stored in the next agent channel or device store. This mechanism ensures that streams are reliable in end-to-end transport.

Flume uses the Transactional approach to ensure reliable transport of events. In Source and Slink, the storage and recovery of events are encapsulated as transactions, and the storage of events into and pulling events from channels are transactional. This ensures that events in the stream are transmitted reliably between nodes.

recoverable

Events are carried out in a channel, which is responsible for ensuring event recovery from faults. Flume supports a persistent file channel (file mode: channel.type = “file”) supported by the local file system. Memory mode (channel.type = “memmory”) is also supported, where events are stored in memory queues. Obviously, the memory mode performs better than the file model. However, when the Agent process fails, the events stored in the channel in memory mode are lost and cannot be recovered.

build

Building an Agent

Flume Agent configuration is saved in a local configuration file. It is a text text whose properties can be easily read directly by a Java program. You can specify one or more agent configurations in the same configuration file. The profile specifies the properties of each source, channel, and slink in AgNET, and how they are combined to form a data flow.

Configuring a Single Component

Each component in the flow (Source, Channel, slink) has its own name, type, and set of configuration properties. For example, an Avro source needs to be configured with a hostname (or IP address) and a port number to receive data. A memory mode channel can have the property of maximum queue length (” Capacity “: the maximum number of events that can be held in the channel). An HDFS slink needs to know the URL of the file system (HDFS ://****), the path where the file lands, and the rollback rate of the file (“hdfs.rollInterval”: how many seconds to roll back the zero-hour file to the final file and save it to HDFS). All of these properties for each component need to be specified in the configuration file.

Put the pieces together

The Agent needs to know which components to load and how to combine them to form a data flow. Flume specifies the name of each component (source, channel, slink) and explicitly tells us which source and slink the channel is connected to so that the components can be combined. For example, a source named “avroWeb” sends events to the HDFS sink through a channel named “file-channel”. The configuration file should contain the names and composition relationships of these components.

Starting an Agent

You can start the Agent using the flume-ng script file in the Flume bin directory. After the command, you need to specify the name of the agent and the configuration file:

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template
Copy the code

When you run the above command, the Agent will run the component as described in the configuration file.

A simple example

Here, we show an example of a configuration file for a flume single-node deployment.

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Copy the code

Looking at the configuration file, we can see that the name of the agent is A1. The source of the agent listens to port 44444. Channel is in memory mode, while Slink outputs data directly to the console (Logger). The configuration file specifies the names of the various components and describes their types and other properties. Of course, a configuration file can be configured with multiple Agent properties. When we want to run a specified agent process, we need to give the agent’s name on the command line:

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
Copy the code

Note that in a real deployment, we would normally include an option: –conf-file =. The directory will contain a shell script, flume-env.sh, and a log4j properties file. In this example, we pass a Java option to force Flume to output logs to the console.

In the following example, we can remotely Telnet port 44444 to send data to the Agent:

$Telnet localhost 44444 Trying 127.0.0.1... Connected to localhost. Localdomain (127.0.0.1).escape character is'^]'.
Hello world! <ENTER>
OK
Copy the code

The Console of the Agent process will print the data sent over Telnet:

12/06/19 15:32:19 INFO source.NetcatSource: Source starting 12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket: sun. Nio. Ch. ServerSocketChannelImpl [/ 127.0.0.1:44444] 12/06/19 15:32:34 INFO sink. LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D Hello world! .}Copy the code

With this step completed, you have successfully configured and deployed a Flume Agent.

Data ingestion

Flume supports a number of mechanisms for obtaining data from external sources.

RPC

An Avro client can send a specified file to a source using RPC:

$ bin/flume-ng avro-client -H localhost -p 41414 -F /usr/logs/log.10
Copy the code

The above command sends /usr/logs/log.10 to source listening on port 41414.

Network Streams

Flume supports reading data from some popular log streams, such as:

  • Avro
  • Thrift
  • Syslog
  • Netcat

Setting multi-agent flow

Consolidation.

To collect log information from multiple hosts, you can deploy the Agent on each host. The slink of these hosts is connected to the source of the host on which the final log is generated. Landing host Combine all data and land it on the HDFS.

Scan code number concern WeChat public “Kooola big data,” life | chat technology