The Hadoop system runs on a computing cluster consisting of common commercial servers that provide large-scale distributed data storage resources as well as large-scale parallel computing resources.

In terms of big data processing software system, with the development of open source Apache Hadoop system, the Hadoop platform has evolved into a complete big data processing ecosystem containing many related subsystems on the basis of basic subsystems such as HDFS, MapReduce and HBase. Figure 1-15 shows the basic components and ecosystem of the Hadoop platform.

  


1.MapReduce parallel computing framework

MapReduce parallel computing framework is a parallel program execution system. It provides a Map and Reduce two-stage parallel processing model and process, provides a parallel programming model and interface, so that programmers can write big data parallel processing programs easily and quickly. MapReduce processes data in key-value data input mode and automatically divides and manages data. During program execution, the MapReduce parallel computing framework is responsible for scheduling and allocating computing resources, dividing input and output data, executing the scheduler, monitoring the execution status of the program, synchronizing computing nodes during program execution, and collecting intermediate results. The MapReduce framework provides a complete set of programming interfaces for programmers to develop MapReduce applications.

2. Distributed file system HDFS

Hadoop Distributed File System (HDFS) is an open source Distributed File System similar to Google GFS. It provides an expandable, highly reliable and highly available large-scale data distributed storage management system. Based on the local Linux file system that is physically distributed on each data storage node, it provides a logically integrated large-scale data storage file system for upper-layer applications. Similar to the GFS, the HDFS uses the multi-copy (three copies by default) data redundancy storage mechanism and provides an effective data error detection and data recovery mechanism, greatly improving data storage reliability.

3. HBase distributed database management system

To overcome the disadvantage of HDFS in managing massive structured and semi-structured data, Hadoop provides HBase, a large-scale distributed database management and query system. HBase is a distributed database based on the HADOOP Distributed File System (HDFS). It is a distributed and scalable NoSQL database that provides real-time read/write and random access to structured, semi-structured, and even unstructured big data. HBase provides a THREE-DIMENSIONAL data management model based on rows, columns, and timestamps. Each table in HBase can have billions or more records (rows), and each record can have millions of fields.

4. Common service module

Common is a set of class libraries and API programming interfaces that provide underlying support services and Common tools for the entire Hadoop system, such as FileSystem, Remote procedure call (RPC), system Configuration tool Configuration, and serialization mechanism. In versions 0.20 and earlier, Common contained HDFS, MapReduce, and other Common project content; Starting from version 0.21, HDFS and MapReduce are separated into independent sub-projects, and the remaining parts form Hadoop Common.

5. Avro data serialization system

Avro is a data serialization system for converting data structures or data objects into a format that facilitates data storage and network transmission. Avro offers rich data structure types, fast and compressible binary data formats, filesets for storing persistent data, remote calls to RPC, and simple dynamic language integration.

6. Distributed coordination service framework Zookeeper

Zookeeper is a distributed coordination service framework used to solve consistency problems in distributed environments. Zookeeper provides functions such as system reliability maintenance, data status synchronization, unified naming service, and configuration item management for distributed applications. Zookeeper can be used to maintain some important status data with small amount of data in system operation and management in distributed environment, and provide a mechanism to monitor data status changes, so as to cooperate with other Hadoop subsystems (such as HBase, Hama, etc.) or user-developed application systems. Solve problems such as system reliability management and data status maintenance in distributed environment.

7. Hive, a distributed data warehouse processing tool

Hive is a data warehouse based on Hadoop to manage structured and semi-structured data stored in HDFS or HBase. It was first developed by Facebook to process and analyze large amounts of user and log data. In 2008, Facebook contributed it to Apache as an open source project of Hadoop. To facilitate traditional SQL database users to query and analyze data using Hadoop, Hive allows users to write data query and analysis programs using HiveQL query language similar to SQL as a programming interface, and provides data extraction and conversion, storage management, and query and analysis functions required by data warehouses. HiveQL statements are converted to the corresponding MapReduce program for execution during the underlying implementation.

8. Data stream processing tool Pig

Pig is a platform for processing large datasets, developed by Yahoo! Contribute to Apache as an open source project. It simplifies data analysis using Hadoop and provides Pig Latin, a domain-oriented high-level abstraction language that allows programmers to turn complex data analysis tasks into data flow scripts on Pig operations that are automatically converted into MapReduce task chains when executed. Execute it on Hadoop. Yahoo! A large number of MapReduce jobs are performed using Pig.

9. Cassandra, a key-value pair database system

Cassandra is a set of distributed K-V database system, which was originally developed by Facebook for storing simple formatted data such as mailbox. Later, Facebook contributed Cassandra as an open source project of Hadoop. Cassandra builds on Amazon’s proprietary fully distributed Dynamo and combines it with Google BigTable’s Column Family-based data model to provide a highly scalable, ultimately consistent, distributed structured key-value storage system. It combines Dynamo’s distribution technology with Google’s Bigtable data model to better meet the needs of massive data storage. At the same time, Cassandra has changed from vertical scaling to horizontal scaling, providing more functionality than other typical key-value data storage models.

10. Log data processing system Chukwa

Chukwa is a Yahoo! The contributed open source data collection system is mainly used for log collection and data monitoring, and works with MapReduce to process data. Chukwa is a large-scale cluster monitoring system based on Hadoop. It inherits the reliability of Hadoop system and has good adaptability and scalability. It uses HDFS to store data and MapReduce to process data. It also provides flexible and powerful auxiliary tools to analyze, display, and monitor data results.

11. Scientific computing basic tool library Hama

Hama is a computing framework based on Bulk Synchronous Parallel (BSP) model. It mainly provides a set of supporting frameworks and tools to support large-scale scientific computing or graph computing with complex data correlation. Hama is similar to Pregel developed by Google, Google uses Pregel to achieve graph traversal (BFS), shortest path (SSSP), PageRank and other calculations. Hama can be perfectly integrated with Hadoop’s HDSF, and HDFS is used for persistent storage of tasks and data that need to be run. Due to the flexibility of BSP in parallel computing model, Hama framework can be more applied in large-scale scientific computing and graph computing, complete matrix calculation, sorting calculation, PageRank, BFS and other different big data calculation and processing tasks.

12. Data analysis and mining tool Library Mahout

Mahout comes from the Apache Lucene subproject and its main goal is to create and provide libraries of classic machine learning and data mining parallel algorithms so that programmers who need to use these algorithms for data analysis and mining don’t have to implement them themselves. Mahout now includes a wide range of machine learning and data mining algorithms for clustering, classification, recommendation engines, frequent itemset mining, and more. In addition, it provides tools and frameworks that include data input and output tools and data integration with other data storage and management systems.

13. Relational data exchange tool Sqoop

Sqoop, short for SQL-to-Hadoop, is a tool for fast bulk data exchange between relational databases and the Hadoop platform. It can import data from a relational database to HDFS, HBase, and Hive of Hadoop in batches, or conversely, import data from Hadoop platform to relational database. Sqoop makes full use of the parallelization advantages of Hadoop MapReduce, and the whole data exchange process is realized fast parallelization based on MapReduce.

14. Log data collection tool Flume

Developed and maintained by Cloudera, Flume is a distributed, highly reliable, and highly available system for collecting large-scale log data in complex environments. It abstracts data from the process of production, transmission, processing and output to data flow, and allows the definition of data sender in the data source, thus supporting the collection of data based on a variety of different transport protocols, and provides simple data filtering, format conversion and other processing capabilities of log data. Flume supports writing log data to customized output targets.