Hadoop for the Backend: Components of Hadoop.

Preface:

What is big data? What are the characteristics of big data? Then we expanded from big data to Hadoop, talking about the three main questions, what is Hadoop, the history of Hadoop, and what are the advantages of Hadoop compared to other big data frameworks?

Today, we will continue to explore the basic components of Hadoop along with the context of the previous article. What technologies organically combine together to make Hadoop perform well in the field of big data today? After Hadoop2.0, Hadoop is mainly composed of the following three parts:

Map - Reduce: Responsible for calculation
Yarn: Schedules resources
HDFS: Is responsible for data storage

These three technologies complement each other, and of course this article is just the beginning of understanding the role of these three technologies in Hadoop, which will be covered in more detail in a later section on Map-Reduce and HDFS.

Map-ruduce programming model:

I think we all know the meaning of Map. After all, Java cannot be used too much, and Reduce means to Reduce and Reduce, so map-reduce is a process of dividing (Map) and then reducing (Reduce).

Let’s look at the definition:

MapReduce is a programming framework for distributed computing programs and a core framework for users to develop Hadoop-based data analysis applications. The core function is to integrate user-written business logic code and default components into a complete distributed computing program, which runs concurrently on a Hadoop cluster.

MapReduceIt can be summarized as followsmapPhase andreducePhase.

Just looking at the definition is a bit obscure, but what is map-reduce in common sense? Again, the example from our last post:

Junior high school of time, the boy like the fantasy, for fear of being the guidance director, so the scheme of distributed storage of the book is divided into several pages, on different classmates stood over there, but the guidance director is not a fool, while the priest climbs a post, the so-called ‘while the priest climbs a post, it is for this reason, finally was found, but is also rumoured to cobble together a today if you don’t put the book into his office, All waiting for the parents.

Finally, we all handed the remnants of the hand to the monitor Xiaoming, Xiaoming sorted according to the page number, to the director of teaching.

Teaching director said you this is not idle, every day not good study put that see of this what, the head broken sky, is the English book bad back, or math book not good-looking? So, you are not idle get panic, with respect to this Xiao Yan, with respect to him, you go down to check for me, whole book this name appeared in total how many times! I’m not eating until I finish!

Xiaoming thought, this is not over, I check, I have to check to monkey year horse month to check over.

Here’s the thing: traditional programming models that need to know the frequency of a single word in a book write a program that traverses the entire file. A few words are fine, but traversing the entire universe takes you as long as you can eat dinner.

Isn’t there multithreading?

There are threads, but only if we have a multi-core or multi-processor computer, and multithreaded programs are a little complicated to write.

But not stupid ah, xiao Ming Ming thought, MMPS, not I a person to see, why my number, so xiao Ming had a plan, go back to class, everyone to share happiness, the teacher let me several ZSZSZSZ there have been many times in the book, I count myself to can’t finish the number tomorrow, who looks everybody who came to a number of pages, And then you count them on the bottom and give them to me.

So the class boys a number of dozens of pages, less than an hour to finish the number, Xiao Ming successfully through a disaster.

This is map-reduce. I can’t do it by myself. I find 10 people to do it in parallel, and finally summarize the results.

Of course, map-Reduce is definitely not as simple as what we said above, and the specific implementation details are a little complicated. The detailed implementation process and principle will be analyzed in detail in the map-Reduce topic.

Yarn:

Yarn was born in the hadoop2.x era. In the remote Hadoop1.x era, Map-Reduce was not only responsible for computing, but also responsible for resource scheduling. Map-reduce acted as both father and mother. I bet headquarters didn’t hear, because two people can’t do the job for two anymore. Then one day, Map-Reduce finally got fed up and quit. Map-reduce used to do computing and resource scheduling, so Map-Reduce quit, and the whole Hadoop computing and resource scheduling system stopped working.

The coupling is too severe, so Hadoop decides it’s no good, but is it okay to get stuck by Mapreduce? Hadoop hires another resource scheduler, called Yarn. Map-reduce is only responsible for computing, and Yarn is only responsible for resource scheduling. There was no longer a one-man strike or a layoff.

Yarn does four things:

ResourceManager (RM) :

Process client requests.
monitoringNodeManager.
Start or monitorApplicationMaster.
Resource allocation and scheduling.

NodeManager (NM) :

Manage resources on a single node.
fromResourceManagerThe command
fromApplicationMasterThe command

ApplicationMaster (AM) :

Responsible for data segmentation.
Apply for resources for the application and assign them to internal tasks.
Task monitoring and fault tolerance.

The Container:

Container is a resource abstraction in YARN. It encapsulates multi-dimensional resources on a node, such as memory, CPUS, disks, and networks.

And so on.

HDFS:

Hadoop Distributed File System (HDFS) :Hadoop Distributed File System (HDFS) :Hadoop Distributed File System (HDFS)

Compared with map-Reduce and Yarn, HDFS is easier to understand. The HDFS architecture is divided into three parts:

NameNode (nn) :

Store file metadata, such as file name, file directory structure, file properties (generation time, number of copies, and file permissions), block list of each file, and DataNode where the block resides.

NameNode mainly stores metadata of files. For example, when we borrow books from a library, NameNode stores the contents, authors, file properties, and locations of all books in the library.

The DataNode (dn) :

DataNode(DN) : Stores file block data and the checksum of block data in the local file system.

If NameNode stores directories, then datanodes are the bookshelves for books. That is, our actual data is stored on datanodes.

Secondary NameNode nn (2) :

Secondary NameNode(2nn) : Secondary background program used to monitor the HDFS status and obtain snapshots of HDFS metadata at intervals

NameNode: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx: Nginx Secondary NameNode(2nn) : serves as a Secondary backup.

Secondary NameNode(2nn) is used to restore the Secondary NameNode(2nn).

These are all part of HDFS as a whole, and of course there will be a HDFS section that will explain this in more detail.

Let’s start with the technical summary:

Map-ruduce Yarn and HDFS file system are responsible for distributed computing, resource scheduling and distributed storage respectively. Each of them is indispensable. It is the perfect cooperation of these three technologies in Hadoop. Created today Hadoop status in the field of big data, see here, I think even though we may not be aware of Yarn inside how to coordinate resources, MR how is parallel computing, but I believe that everybody for Hadoop there must be a just understanding, at the beginning of the next article, we speak step by step through the virtual machine configuration, Then implement the configuration of our pseudo-distributed environment of Hadoop.

Thank you very much for reading this, your support and attention is my motivation to continue high-quality sharing.

Relevant code has been uploaded to my Github. Make sure you hit “STAR” ahhhhhhh

Long march always feeling, to a star line

Hanshu development notes

Welcome to like, follow me, have good fruit to eat (funny)

Hadoop for the Backend: Components of Hadoop.

Preface:

Map-ruduce programming model:

Yarn:

HDFS:

Let’s start with the technical summary:

Related Posts

Java reflection mechanism is simple

Golang’s Zap’s SugaredLogger

This article discusses the monitoring and reclamation of off-heap memory