In 2011, there were only a few hadoop-related questions on Baidu every day. In 2015, there were more than 8 million hadoop-related questions on Baidu. But now it has passed hundreds of millions, and Hadoop has become a necessary infrastructure for big data. Hadoop is recognized as a set of industry standard open source software for big data, providing massive data processing capability in distributed environment. Almost all major vendors revolve around Hadoop development tools, open source software, commercial tools, and technical services. Large IT companies such as EMC, Microsoft, Intel, Teradata, and Cisco have significantly increased their Hadoop investments in recent years. So what exactly is Hadoop? What does it do? What is its infrastructure like? Today I’ll take a quick look at some of the basic concepts of Hadoop.

What is Hadoop?

Hadoop is a distributed system infrastructure developed by the Apache Foundation. It is a software framework of storage system and computing framework. It mainly solves the problems of massive data storage and computation, and is the cornerstone of big data technology. Hadoop performs data processing in a reliable, efficient and scalable way. Users can develop distributed programs without understanding the details of the distributed bottom layer. Users can easily develop and run applications dealing with massive data on Hadoop.

What problems can Hadoop solve

1. Massive data storage

HDFS has high fault tolerance and is designed to be deployed on low-cost hardware. Moreover, it provides High throughput to access data, which is suitable for applications with large data sets. It consists of N machines running Datanodes and 1 (another standby) running NameNode processes. Each DataNode manages part of the data, and NameNode manages the information (storing metadata) of the entire HDFS cluster.

2. Resource management, scheduling and allocation

Apache Hadoop YARN (Yet Another Resource Negotiator) is a new Hadoop Resource manager. It is a general Resource management system and scheduling platform that provides unified Resource management and scheduling for upper-layer applications. Its introduction brings great benefits to cluster in utilization rate, unified resource management and data sharing.

What is the Hadoop component architecture

After looking at the basics of Hadoop. To learn about the core architecture and principles of HDFS and YARN, start with the HDFS framework diagram:

After looking at the picture above, consider a few questions:

1. What is metadata information? How does NameNode maintain metadata and how does metadata information ensure consistency?

NameNode maintains metadata information about the HDFS cluster, including the directory tree of files, the list of data blocks corresponding to each file, permission Settings, and the number of copies. Metadata information is stored in memory. What should I do if NameNode fails unexpectedly? The NameNode metadata modification involves two parts: the memory data modification and then you write an EditLog to the memory. Let's look at FsImage and EditLog: FsImage: FsImage is a mirror file of metadata in NameNode memory. It is a permanent checkpoint of metadata. It contains serialization information of all HDFS directories and idnode files. EditLog: The EditLog is a log of all operations performed on HDFS since the last checkpoint, such as adding files, renaming files, and deleting directories. The EditLog is similar to a bank account flow. Stream information can be very large. So if the Editlog gets very large, you need to read the Editlog to recover metadata after the outage, which is a very slow process. This is where the StandbyNameNode node comes in. A Standby node pulls editlogs from JournalNode sets and periodically merges editlogs into FsImage. FsImage is the merged stock of data. Upload FsImage to the ActiveNode node at the same time.Copy the code

2. How do NameNode Active and Standby switch and always have an ActiveNode?

As you can see in the HDFS diagram above, ZKFC monitors the monitoring status of NameNode. ZKFC uses the primary and secondary node election provided by ZK to switch. ZKFC notifies and changes the status of NameNodeCopy the code

YARN framework diagram:

The preceding figure describes the process of submitting and allocating resources to a YARN task. The following components are involved in the process:

ResourceManeger: Monitors, allocates, and manages all resources, processes client requests, starts and monitors AppMaster,NodeManager NodeManager: A single node on the resource management and task management, dealing with the ResourceManager, AppMaster command AppMaster: responsible for a particular application scheduling and coordination, apply for the application resources, and to monitor the task Container: YARN is a concept of dynamic resource allocation. It has a certain number of memory and cores. The overall process of submitting a task is as follows: (1) The Client submits applications to YARN, including ApplicationMaster, commands, user programs, and resources. (2) ResourceManager allocates the first Container for the application and communicates with the NodeManager to ask it to start ApplicationMaster of the application in this Container. (3) ApplicationMaster registers with ResourceManager so that users can view the running status of applications through ResourceManager. Then, ApplicationMaster applies for resources for each task. (4) ApplicationMaster applies for and obtains resources from ResourceManager through RPC protocol in polling mode. (5) Once ApplicationMaster has requested the resource, it communicates with the corresponding NodeManager asking it to start the task. (6) NodeManager sets the running environment (including environment variables, Jar packages, binaries, etc.) for the task, writes the task startup command to a script, and starts the task by running the script. (7) Each task reports its status and progress to ApplicationMaster through an RPC protocol so that ApplicationMaster can keep track of the running status of each task and restart the task if the task fails. While the application is running, the user can query the current running state of the application from the ApplicationMaster via RPC at any time. (8) After the application is run, ApplicationMaster deregisters itself from ResourceManager and shuts itself down.Copy the code

From the above, you can get a snapshot of some of the basic frameworks of Hadoop. You can refer to the above structure diagram and Hadoop official website or community for further understanding.

This article was published in: Stack research Club

We have an open source project on Github called Flinkx, so please check it out