Fan Donglai, Spark Contributor. This article is excerpted from “Spark-on-Learning-as-you-Go Talk 44”
Hi, I am Fan Donglai, your Spark teacher. In this class, I will introduce to you “Cluster operating system: Hadoop”.

The emergence of Hadoop is undoubtedly a blessing for users who have data but suffer from not being able to analyze it. In addition, with the popularity of mobile Internet during that period, data showed a geometric multiple growth, and Hadoop solved the pain point of data processing to a large extent. For a long time, Hadoop was the de facto standard for big data processing, and until now, many companies have built their big data processing architectures around Hadoop.

Based on this, this class mainly discusses the following questions:

  • Hadoop 1.0
  • Hadoop 2.0
  • Hadoop ecosystem and distribution
This article is excerpted from “Spark-on-Learning-as-you-Go Talk 44”

Hadoop 1.0
Since its inception, Hadoop has gone through three major versions, namely 1.0, 2.0 and the latest 3.0, among which the most representative version is 1.0 and 2.0. 3.0 has little change compared with 2.0. Hadoop 1.0 has a relatively simple architecture, which is basically implemented in accordance with the framework in the paper, as shown in the following figure:





The lower layer is HDFS (Hadoop Distributed File System), an open source implementation of GFS, and the upper layer is MapReduce, a distributed computing framework based on distributed file systems, which seems reasonable. However, in the process of using this architecture, there are still many problems, mainly including three points:

  1. The active node has poor reliability and no hot spare.
  2. When too many MapReduce jobs are submitted, scheduling becomes the bottleneck of distributed computing.
  3. Resource utilization is low and other types of distributed computing frameworks are not supported.
Point 1 is a minor issue, involving changes to the usability of the system, but points 2 and 3 are more critical.

The second problem is that MapReduce, the distributed computing framework of Hadoop 1.0, does not separate the two components of resource management and job scheduling. As a result, when multiple jobs are submitted at the same time, the resource scheduler will be overwhelmed, resulting in low resource utilization. The third problem is that heterogeneous computing frameworks are not supported. What does this mean? Spark was already out there, but if Hadoop 1.0 was deployed on that cluster, you would have to deploy a second cluster to run Spark jobs, which would be a waste of resources and unreasonable, but there’s nothing to be done about it because it’s a legacy of the paper.

This article is excerpted from “Spark-on-Learning-as-you-Go Talk 44”

Hadoop 2.0
Based on these problems, the community began to develop Hadoop 2.0. The biggest change in Hadoop 2.0 is that the resource management and scheduling system YARN is introduced to replace the original computing framework, and the computing framework becomes users similar to YARN, as shown in the following figure:





YARN abstracts all computing resources in a cluster into a resource pool. The resource pool has two dimensions: CPU and memory. Also based on HDFS, YARN manages computing resources and HDFS manages storage resources. In addition, YARN adopts a two-layer scheduling design, which greatly reduces the scheduler’s burden. I will explain it in detail in subsequent courses.

Hadoop 2.0 basically improves the major defects of Hadoop. In addition, YARN can be compatible with multiple computing frameworks, such as Spark, Storm, MapReduce, etc. HDFS has also become the underlying storage of many systems. Hadoop has an eclectic approach to big data open source technology components, gradually forming a large ecosystem, as shown in the figure below (which shows only a few components). Back then, if you wanted to build a big data platform, you couldn’t bypass Hadoop.



Hadoop ecosystem and distribution


Each component of the Hadoop ecosystem contains the core components of Hadoop, such as HDFS and YARN. There are also more options in the computing layer, such as SQL-enabled Hive, Impala, Pig, Spark, Storm, etc. There are also toolclass components, such as Sqoop for bulk data extraction, Flume for streaming data transfer, and Zookeeper for distributed consistency. There are also operational components, such as Ambari for deployment, Ganglia for cluster monitoring, and so on. These components may seem cumbersome, but they are all necessary for a production environment. So at the time, there were all sorts of problems with integrating so many components into one platform.

Cloudera and Hortonworks, whose core product is to package the most commonly used open-source components in the Hadoop ecosystem into a Hadoop distribution, Clouera’s CDH, Hortonworks, called HDP, is free to use for all components without any quirks such as compatibility, but also has a premium version for less technologically sophisticated companies.

In the heyday of Hadoop, almost all companies’ big data platforms used the Hadoop distributions of these two companies. Cloudera was recognized by the capital market and was once valued at $5 billion. However, with the decline of Hadoop, Cloudera’s stock price has been declining since it went public. It eventually merged with Hortonworks, also a public company, with a combined share price of just $2 billion. It’s worth noting that Doug Cutting, the father of Hadoop, is also a Cloudera member.



conclusion


This lesson is mainly about Hadoop architecture and some key components and other conceptual things. As a cluster operating system, Hadoop will not quit the stage of big data in the short term and will be difficult to exit in the future.

That’s all for this class, next class I will explain to you: “Unified resource management and scheduling system design and implementation”, remember to come to class on time oh.


This article is excerpted from “Spark-on-Learning-as-you-Go Talk 44”


Copyright notice: The copyright of this article belongs to Pull hook education and the columnist. Any media, website or individual shall not be reproduced, linked, reposted or otherwise copied and published/published without the authorization of this agreement, the offender shall be corrected.