This is the 5th day of my participation in the wenwen Challenge

This article has participated in the weekend study program, click to see more details

Big Data Architecture -Lambda

  • Lambda architecture was proposed by Storm author Nathan Marz. The aim is to design an architecture that can meet the key characteristics of real-time big data systems, with high fault tolerance, low latency and scalability.
  • Lambda architecture integrates offline and real-time computing, Immutability, read/write separation, complexity isolation and other architectural principles, and can integrate Hadoop, Kafka, Storm, Spark, HBase and other big data components

Three-tier architecture: batch processing layer, real-time processing layer and service layer

Flume and Kafka for data acquisition

Flume

Cloudera provides a highly available, highly reliable, and distributed system for collecting, aggregating, and transferring massive logs. Flume supports customized data sending parties in the log system for data collection.

Flume provides the ability to simply process data and write it to a variety of (customizable) data recipients. Flume provides the capability to collect data from data sources such as console (console), Thrift-RPC (RPC), text (file), TAIL (UNIX tail), syslog (a syslog log system that supports TCP and UDP), and EXEC (command execution)

Kafka

Apache Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a fast, scalable, and inherently distributed, partitioned, and replicable commit logging service.

Apache Kafka differs from traditional messaging systems in the following ways:

  • It is designed as a distributed system, easy to scale out;
  • It provides high throughput for both publish and subscribe;
  • It supports multiple subscribers and automatically balances consumers when it fails;
  • It persists messages to disk, so it can be used for bulk consumption, such as ETL, as well as for real-time applications

Workflow – OOzie

Tasks performed in Hadoop sometimes require multiple Map/Reduce jobs to be connected together.

Data analysis tool: Pig

Pig is a Hadoop-based platform for large-scale data analysis that provides a SQL-like language called Pig Latin, whose compiler converts SQL-like data analysis requests into a series of optimized MapReduce operations. Pig provides a simple operational and programming interface for complex parallel computing of massive data.

RDBMS and Hadoop data migration tool: Sqoop

Sqoop=SQL+hadoop

Data mining analysis tool: Mahout

Mahout is a powerful data mining tool that is a collection of distributed machine learning algorithms, including the implementation of distributed collaborative filtering called Taste, classification, clustering, and more. The biggest advantage of Mahout is that it is based on Hadoop and converts many of the algorithms that used to run on a single machine into MapReduce mode, which greatly improves the amount of data and performance that can be processed by the algorithm.

Spark: Large, low-latency, memory-based data analysis applications;