Reference: Spark Programming Foundation (Scala edition) published by Lin Ziyu, Lai Yongxuan, Tao Jiping, Posts and Telecommunications Press

1.HDFS Distributed file system

Hadoop Distributed File System (HDFS) is an open source implementation of Google Distributed File System (GFS). As one of the two core components of Hadoop, HDFS provides the ability to store large-scale distributed files in inexpensive server clusters.

HDFS has good fault tolerance and is compatible with cheap hardware devices. Therefore, it can use existing machines to realize read and write large amounts of traffic and data at a low cost.

HDFS adopts the Master/Slave structure model. An HDFS cluster consists of a name node and several data nodes. The name node serves as the central server that manages the namespace of the file system and client access to files.

In a cluster, a data node runs a data node process that processes read/write requests from file system clients and creates, deletes, and replicates data blocks based on the unified scheduling of named nodes.

Mechanism for clients to access files

When using HDFS, users can still use file names to store and access files just as in normal file systems.

In fact, within the system, a file is divided into several data blocks, which are distributed across several data nodes.

When a client needs to access a file, first turn on the file name sent to the name of the node, the name of the node according to the file name to find the corresponding data block (a file may include multiple data block), according to each data block information to find the location of the actual storage of each data block data node, and the data is sent to the client node position, in the end, Clients directly access these data nodes to obtain data. The name node does not participate in data transfer during the entire access process.

This design method makes the data of a file can be concurrently accessed on different data nodes, which greatly improves the speed of data access.

2.MapReduce

MapReduce is a distributed parallel programming model for parallel computing on large data sets (larger than 1TB). It highly abstracts complex parallel computing processes running on large clusters into two functions: Map and Reduce. MapReduce greatly facilitates distributed programming. Without distributed parallel programming, programmers can easily run their programs on distributed systems to compute massive data sets.

Parallel computing process

In MapReduce, a large data set stored in a distributed file system is sliced into small, independent data blocks that can be processed by multiple Map tasks in parallel. The MapReduce framework inputs a subset of data for each Map task. The result generated by the Map task is used as the input of the Reduce task. The Reduce task outputs the final result and writes it to the distributed file system.

Involve the idea that computing is close to data

MapReduce is designed to “compute to data” rather than “data to computing” because mobile computing is more economical than moving data, which requires a lot of overhead over the network, especially in large data environments. With this in mind, whenever possible in a cluster, the MapReduce framework runs the Map program near the node where THE HDFS data resides, that is, compute nodes and storage nodes run together, thus reducing the cost of moving data between nodes.

3.YARN

YARN is the component responsible for cluster resource scheduling and management. The goal of YARN is to implement multiple frameworks in one cluster. That is, a unified resource scheduling management framework, YARN, can be deployed on a cluster. Other computing frameworks, such as MapReduce, Tez, Storm, Giraph, Spark, and OpenMPI, can be deployed on YARN. YARN provides unified resource scheduling and management services (including CPU and memory resources) for these computing frameworks and adjusts resources according to the load requirements of each computing framework to achieve resource sharing and elastic resource shrinking.

In this way, can achieve a different application load on the cluster mix build, effectively improve the efficiency of the cluster, at the same time, different computing framework can share the underlying storage and integration of multiple data sets in a cluster, using multiple computing framework to access these data sets, avoiding the data set moving across the cluster, in the end, This deployment mode also greatly reduces enterprise operation and maintenance costs.

YARN Specifies the computing framework supported by YARN

Currently, the computing frameworks that can run on YARN include the offline batch processing framework MapReduce, memory computing framework Spark, flow computing framework Storm, and DAG computing framework Tez. Other resource management and scheduling frameworks that provide similar functions as YARN include Mesos, Torca, Corona, and Borg.

4.HBase

HBase is an open source implementation of Google BigTable. It is a highly reliable, high-performance, columb-oriented, and scalable distributed database. It is mainly used to store unstructured and semi-structured data.

HBase supports very large scale data storage. It can scale horizontally using clusters of inexpensive computers to process tables consisting of more than 1 billion row elements and millions of column elements

HBase uses MapReduce to process massive data in HBase to achieve high-performance computing. Zookeeper is used as collaborative service to achieve stable service and failure recovery. Hadoop Distributed File System (HDFS) is used as the highly reliable underlying storage, and cheap clusters are used to provide massive data storage capability. Of course, HBase can also be used in single-machine deployment mode, and the local file system is used instead of HDFS as the underlying data storage mode. To improve data reliability and system robustness, Generally, HDFS is used as the underlying data storage mode of HBase. In addition, Sqoop provides efficient and convenient RDBMS data import for HBase to facilitate data processing on HBase, and Pig and Hive provide high-level language support for HBase.

5.Hive

Hive is a Hadoop-based data warehouse tool that can be used to organize, query, and analyze data sets stored in Hadoop files.

Hive has a low learning threshold because it provides HiveQL, a query language similar to SQL in relational databases. Simple MapReduce statistics can be quickly implemented using HiveQL statements. Hive automatically converts HiveQL statements into MapReduce jobs for running. Without having to develop a dedicated MapReduce application, it is well suited for statistical analysis of data warehouses.

6.Flume

Flume is Cloudera’s highly-available, highly reliable, and distributed system for collecting, aggregating, and transporting mass logs.

Flume can customize various data senders in the log system to collect data. Flume also provides the ability to simply process data and write it to various data receivers.

7.Sqoop

Sqoop, short for SQL-to-Hadoop, is primarily used to exchange data between Hadoop and relational databases, improving data interoperability.

Sqoop allows you to easily import data from relational databases such as MySQL, Oracle, and PostgreSQL into Hadoop (such as HDFS, HBase, or Hive) and export data from Hadoop to relational databases. Makes data migration between traditional relational databases and Hadoop very convenient.