The background,

Overall development process of Hadoop business:



As can be seen from the business development flow chart of Hadoop, data collection is a very important and inevitable step in the business process of big data.

Many companies’ platforms generate a large number of logs on a daily basis (typically streaming data, e.g., pv of search engines, queries, etc.). Specific logging systems are required to process these logs. In general, these systems need to have the following characteristics:

(1) Build a bridge between application system and analysis system, and decouple the association between them;

(2) Support near-real-time online analysis system and offline analysis system similar to Hadoop;

(3) With high scalability. That is, when the amount of data increases, you can expand horizontally by adding nodes.

Open source logging systems, including Facebook’s Scribe, Apache’s Chukwa, linkedin’s Kafka and Cloudera’s Flume, among others.

Flume, a real-time log collection system developed by Cloudera, is widely used in the industry. The original Flume distribution is currently collectively known as Flume OG (Original Generation) and belongs to Cloudera.

However, with the expansion of FLume functions, FLume OG code engineering bloated, core component design is not reasonable, core configuration is not standard and other shortcomings are exposed, especially in the last release of FLume OG 0.9.4. On October 22, 2011, Cloudera implemented fluME-728, a major change to Flume: Core components, core configuration, and code architecture were reconstructed. The reconstructed version was collectively called Flume NG (Next Generation). Cloudera Flume was renamed Apache Flume.

Flume is Apache’s top project.flume.apache.org/

Flume is a distributed, reliable, and highly available massive log aggregation system. You can customize various data sender to collect data. At the same time, Flume provides the ability to handle data simply and write to various data receivers.

X and 1.x, the 1.x version is renamed Flume NG, 0.9x is called Flume OG.

Flume currently has only the Linux startup script, but not the Windows startup script.

3.1Flume Features Flume is a distributed, reliable, and highly available system for collecting, aggregating, and transmitting massive logs. You can customize various data sender in the log system to collect data. In addition, Flume provides the ability to process data and write data to various data receivers, such as text, HDFS, and Hbase.

Flume data flows are traversed by events. Events are the basic data unit of Flume. They carry log data (in the form of byte arrays) and carry header information. These events are generated by sources outside the Agent. You can think of a Channel as a buffer that will hold the event until Sink is done processing it. Sink is responsible for persisting logs or pushing events to another Source.

(1) Flume reliability When a node is faulty, logs can be sent to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak: end-to-end (Agent writes events to disks after receiving data and deletes events after data is successfully transmitted. If the data fails to be sent, it can be re-sent. “, Store on failure (this is also the strategy used by Scribe, where data is written locally when the receiver crashed and sent again after recovery), and “Besteffort” (data is sent to the receiver without confirmation).

(2) The recoverability of Flume depends on Channel. FileChannel is recommended, where events are persisted on the local file system (poor performance).

3.2Flume Core Concepts Client: The Client produces data and runs in an independent thread. Event: A data unit consisting of a message header and a message body. (Events can be logging, AVRO objects, and so on.) Flow: Abstraction of the migration of events from source to destination. Agent: an independent Flume process that contains components Source, Channel, and Sink. Agent runs Flume using the JVM. Each machine runs an agent, but multiple sources and sinks can be included in an agent.) Source: Data collection component. (Source collects data from the Client and passes it to a Channel.) Channel: A temporary store of relay events that are passed by the Source component. (Channel connects sources and sinks, and this is a bit like a queue.) Sink: reads and removes the Event from the Channel, passing the Event to the next Agent (if any) in the FlowPipeline (Sink collects data from the Channel, running on a separate thread).

3.3Flume NG Architecture The core of Flume operation is Agent. Flume agent is the smallest independent running unit. An agent is a JVM. It is a complete data collection tool, which contains three core components, namely Source, Channel and sink. With these components, events flow from one place to another, as shown in the figure below.

Source is the collecting end of data, which is responsible for special formatting after data capture, encapsulating data into events, and then pushing the events into channels.

Flume provides various source implementations, Includes Avro Source, Exce Source, Spooling Directory Source, NetCat Source, Syslog Source, Syslog TCP Source, Syslog UDP Source, and HTTP Source, HDFS Source, etc. Flume also supports custom sources if the built-in Source is not sufficient.

A Channel is a component that connects Source and Sink. It can be regarded as a data buffer (data queue). It can temporarily store the event in memory or persist it to the local disk until Sink finishes processing the event.

Flume provides Memory Channel, JDBC Chanel, File Channel, etc for Channel.

MemoryChannel allows for high-speed throughput, but does not guarantee data integrity.

MemoryRecoverChannel has been replaced with a FileChannel on official documentation recommendations.

FileChannel ensures data integrity and consistency. When configuring unlimited FileChannel, you are advised to set the FileChannel directory and the directory for storing program log files to different disks to improve efficiency.

3.6Sink Flume Sink Retrieves data from a Channel and stores it to a file system or database, or submits it to a remote server.

Flume also provides various sinks, including HDFS Sink, Logger Sink, Avro Sink, File Roll Sink, Null Sink, HBase Sink, etc.

When setting data storage, Flume Sink can store data in file system, database and Hadoop. When log data is small, Flume Sink can store data in file system and set a certain interval for data storage. If there is a large amount of log data, the log data can be stored in Hadoop for future data analysis.



Flume deployment type

4.1 Single Process

4.2 Multi-Agent Process (Multiple Agents Are Connected in sequence)

Multiple agents can be connected sequentially, and the initial data source can be collected and stored in the final storage system. This is the simplest case. In general, the number of agents connected in this sequence should be controlled because the path of data flows becomes longer. If failover is not considered, the failure of Agent collection services on the whole Flow will be affected.

4.3 Flow Merging (Data from multiple Agents is converged on the same Agent)

For example, to collect user behavior logs of Web sites, Web sites use the load cluster mode for availability. Each node generates user behavior logs. You can configure an Agent for each node to collect log data independently. Multiple agents then aggregate the data to a storage system for storing the data, such as HDFS.

4.4 Multiplexing Flow (Multilevel Flow)

Flume also supports multistage streaming. What is multistage streaming? For example, when syslog, Java, Nginx, and Tomcat log flows start flowing into an Agent, the Agent can separate the mixed log flows and establish its own transmission channel for each log.

4.5 the load balance function

In the following figure, Agent1 is a routing node that balances the events temporarily stored by a Channel to multiple Sink components, each of which is connected to an independent Agent.

Flume installation 5.1Flume installation

mirrors.hust.edu.cn/apache/

Flume.apache.org/download.ht…

5.2Flume Installation The Flume framework relies on Hadoop and ZooKeeper only on jar packages and does not require hadoop and ZooKeeper services to be started when Flume is started.

(1) Upload the installation package to the server and decompress [hadoop@hadoop1 ~]$tar -zxvf apache-flume-1.8.0-bin.tar.gz -c apps/ (2) Create a soft connection [hadoop@hadoop1 ~]$ln -s Apache-flume-1.8.0-bin/flume (3) Modifies the configuration file

/ home/hadoop/apps/apache - the flume -- 1.8.0 comes with bin/confCopy the code

[hadoop@hadoop1 conf]$ cp flume-env.sh.template flume-env.sh

(4) Configure environment variable [hadoop@hadoop1 conf]$vi ~/. Bashrc #FLUME

export FLUME_HOME=/home/hadoop/apps/flume
export PATH=$PATH:$FLUME_HOME/bin
Copy the code

Save to take effect immediately

[hadoop@hadoop1 conf]$ source ~/.bashrc

(5) View the version

[hadoop@hadoop1 ~]$ flume-ng version