preface

Speaking of Hadoop, we know that it is an open source distributed computing platform that can run on a large scale cluster, realizing MapReduce computing model and distributed file system HDFS and other functions. However, I don’t know enough about the whole ecosystem of Hadoop. With a love of learning attitude, LET’s explore the ecosystem of Hadoop together.

About Hadoop

When we see big data and information related to big data in our daily life, we think of Hadoop, but when it comes to specifics, we don’t know where to start. Hadoop is developed based on Java, so it has great advantages across platforms and can be deployed on cheap clusters of computers, leading to the popularity of Hadoop. The core of Hadoop is HDFS and MapReduce. HDFS is an open source implementation of Google File System (GFS) with high read/write speed, fault tolerance, and scalability. It can be developed without understanding the underlying details of distributed system by integrating MapReduce. In this way, massive data storage and calculation can be easily completed.

The characteristics of the Hadoop

  • High reliability:Using redundant data storage, even if one copy fails, there are other copies to ensure normal external services.
  • High efficiency:Using distributed storage and distributed processing two core, can be very efficient processing of a large number of data.
  • Highly scalable:As mentioned earlier, deployed on cheap clusters of computers, it’s easy to scale.
  • Low cost:Reference is highly extensible.
  • High fault tolerance:The reference is highly reliable and can automatically reassign failed tasks.
  • Platform:Running on Linux, the cross-platform advantages of Java-based Hadoop can be easily accomplished.
  • Support programming in multiple languages

Hadoop ecosystem

In addition to HDFS and MapReduce, Hadoop also includes many other functional components. For example, zooKeeper, hbase, Hive, Pig, Mahout, SQOOP, Flume, and Ambari are common components.

Here is a brief overview of each component, with the most common ones focusing on description.

  • Ambari: A Web-based tool that creates, manages, and monitors clusters in the Hadoop ecosystem; Hadoop is a tool designed to make Hadoop and related big data software easier to use.

  • Zookeeper: Zookeeper provides consistent collaborative services for distributed systems, such as configuration services, naming services, distributed synchronization, etc. In this article, we also discuss why Zookeeper is not suitable for discovery services. We also discuss how Kafka uses Zookeeper to maintain cluster information in messaging middleware. Let’s talk about Zookeeper features

    1. Final consistency:Most importantly, the client displays the same view no matter which server it connects to.
    2. Reliability:It is simple, robust, and performs well. If a message is accepted by one server, it is accepted by all servers.
    3. Real time:Ensure that the clients can obtain the update or invalid information of the server within a certain interval. However, due to network delay, all clients cannot obtain the update or invalid information of the server at the same time. You should invoke the Sync interface before reading the data.
    4. Waiting doesn't matter:Disconnected, slow or invalid clients cannot interfere with requests from fast clients.
    5. Atomicity:There is only success or failure, there is no in-between.
    6. Order:Transaction requests initiated from the same client will be applied to ZK in strict accordance with the order in which they are sent, including global order and partial order. Global order is easy to understand, that is, all servers publish messages in the same order. Partial order cannot be compared. If A message B is published by the same publisher after message A, A will be placed before B. The core of Zookeeper is atomic broadcast. This mechanism ensures synchronization between servers. When the leader crashes, the server initiates an election.

      Typical application scenarios (configuration file management, cluster management, synchronization lock, Leader election, queue management, etc.)

      Typical cluster mode: Master/Slave (Master/Slave mode)The Master is responsible for writing and the Slave is responsible for reading. However, Zookeeper does not implement this mode and adopts three roles.

      Zookeeper provides three roles
    • Leader:Responsible for voting and resolution, update system status.
    • Learners:It is divided into followers and observers. The former is used to receive customer requests and return results to the client to participate in voting. The latter receives the client write request, forwards it to the leader, does not participate in the vote, and only synchronizes the status of the leader.
    • Client:The initiator of the request.
  • Hbase: a real-time read/write, distributed column database. It is used to compensate for the shortcomings of Hadoop in real-time operation. One important difference from traditional relational databases is that one is row-based and one is column-based. You can call it a key-value store, or you can call it a multi-time version mapped database. Four describe

    • Line key (rowkey) :It can be understood as a unique ID and identifier. Hbase does not allow cross-row transactions. Therefore, the design of rowkeys and column families is important and tricky.
    • Column family/column family:A table must have at least one column family, which consists of multiple columns. Column families are physical properties that affect data storage.
    • Column key:You can view it as a column. Data in a column family is located by column qualifiers. Column qualifiers do not need to be defined in advance, which is also different from relational databases and allows arbitrary extension.
    • Timestamp:It can be understood that the row key, column family, and column qualifier form a cell. The cell is a timed version identified by a timestamp. The default value is 3.
  • Hive: Hive provides Hive QL, a query language similar to SQL language. Hive QL statements can be used to quickly implement simple MapReduce statistics, which is suitable for statistical analysis of data warehouses.

  • Pig: Creates a simpler process language abstraction on MapReduce, providing an interface closer to structured Query Language (SQL). Pig’s greatest use is to implement a set of shell scripts for the MapReduce algorithm, called Pig Latin.

  • Mahout: A framework for machine learning designed to make it easier and faster to create intelligent applications. Mahout includes many implementations, such as clustering, sorting, recommendation filtering, frequent subitem mining, and more.

  • MapReduce: an offline computing framework used for parallel computing of large-scale data sets. The parallel computing process is highly abstracted into two functions, Map and Reduce. In this way, parallel application development can be carried out without understanding the underlying details of the distributed system. The core idea is: “divide and conquer”, which divides the input data set into several independent databases, distributes them to each sub-node for parallel completion, and finally aggregates the results to get the final result.

  • YARN: The scheduling system, the Resource Manager, was originally designed to fix the shortcomings of MapReduce’s Master/Slave implementation, with one JobTracker for job scheduling and resource management and multiple Tasktrackers for specific tasks assigned to them. Therefore, faults such as single point of failure, task overload, memory overflow, and unreasonable resource allocation may occur.

  • HDFS: Hadoop distributed file system is an open source implementation of Google’s FILE system GFS. It can handle large data, stream processing, and run on cheap commercial servers.

  • Flume: a distributed system for collecting, aggregating, and transferring massive logs. Flume has a high availability, distributed, and configuration tool. It is designed based on the principle that data streams, such as log data, are collected from various web servers and stored in centralized storage such as HDFS and HBase.

  • Sqoop: Sqoop is an abbreviation for SQL-to-Hadoop, which is used to exchange data between Hadoop and relational databases. Sqoop can easily import data from relational databases to Hadoop or export data from Hadoop to relational databases. Sqoop interacts with relational databases through JDBC.

Add several other functional components

  • Storm: Storm is Twitter’s open source distributed real-time big data processing framework and streaming computing platform. It has the advantage of no delay, but has the disadvantage of not being flexible enough. You must know in advance what you want to collect. To achieve real-time updates, start processing data as it comes in, such as AD click calculations, which are far more real-time than the MapReduce computing framework.

  • Spark: The MapReduce computing framework is not suitable for iterative and interactive computing. MapReduce is a disk computing framework, while Spark is a memory computing framework. It stores data in memory to improve the computing efficiency of iterative and interactive applications.

  • Pregel: Pregel is a large-scale distributed graph computing platform proposed by Google. It is specially designed to solve the problems of large-scale distributed graph computing in practical applications such as web link analysis and social data mining. As a computing framework of distributed graph computing, Pregel is mainly used for graph traversal, shortest path, PageRank calculation, etc.

Improvements to Hadoop and improvements to the Hadoop framework itself, from 1.0 to 2.0

component The problem of Hadoop1.0 The improvement of Hadoop2.0
HDFS Single – name node has single point of failure HDFS HA is designed to provide a hot backup mechanism for named nodes
A single namespace cannot achieve resource isolation HDFS federation is designed to manage multiple namespaces
MapReduce Low efficiency of resource management A new resource management framework YARN is designed

The Hadoop ecosystem continues to evolve

component function Fix problems in Hadoop
Pig A script language that processes large-scale data. After a user writes a few simple statements, the system automatically converts them into MapReduce jobs Low level of abstraction, requiring a lot of hand-written code
Oozie Workflow and collaboration service engines that coordinate the different tasks running on Hadoop There is no dependency management mechanism for jobs, requiring users to handle the dependencies between jobs themselves
Tez A computing framework that supports DAG jobs, refactoring and recombining the operations of jobs to form a overlapping DAG job and reduce unnecessary operations Duplicate operations exist between MapReduce jobs, which reduces efficiency
Kafka Distributed publish and subscribe messaging system. Different types of distributed systems can be connected to Kafka in a unified manner to realize real-time and efficient exchange of different types of data with various components of Hadoop The Hadoop ecosystem lacks a unified, efficient data exchange intermediary between components and other products

conclusion

Here is a simple categorization description of some functional components of the Hadoop ecosystem, not related to specific implementation. For those interested, see a detailed look at the functions of individual components. A brief introduction to Hbase and some design strategies