Hadoop basics Hadoop consists of two parts:

1.Hadoop Distributed File System

HDFS has high fault tolerance and can be deployed on low-cost hardware devices. HDFS is suitable for applications with large data sets and provides high throughput of data reads and writes. HDFS is a master/slave structure. Generally speaking, only one Namenode is run on the master, and one Datanode is run on each slave.

HDFS supports the traditional hierarchical file organization structure and is similar to some existing file systems in operation. For example, you can create and delete a file, move a file from one directory to another, rename a file, and so on. Namenode manages the entire distributed file system. Operations on the file system, such as creating and deleting files and folders, are controlled by Namenode.

2. Implementation of MapReduce

MapReduce is one of Google’s key technologies. It is a programming model for computing large amounts of data. Parallel computing is usually used to deal with large amount of data. For now, at least, parallel computing remains a distant prospect for many developers. MapReduce is a programming model that simplifies parallel computing by allowing developers with little parallel computing experience to develop parallel applications.

MapReduce takes its name from the two core operations in this model: Map and Reduce. To put it simply, a Map is a one-to-one mapping of one set of data to another set of data. The mapping rules are specified by a function. For example, the mapping of [1, 2, 3, 4] by two becomes [2, 4, 6, 8]. Reduce is the reduction of a set of data, and the reduction rules are specified by a function. For example, the sum of [1, 2, 3, 4] is reduced to 10, while the product is reduced to 24.