The original link

background

The term “big data” is everywhere, and we are living in the century of data. The volume of data has exploded, and the chart below, which forecasts the volume of data over the next few years, shows how fast it’s growing.

As data increases, the storage capacity of hard disks also increases. However, access speed (the rate at which data is read and written) has not kept up, which means that we spend a lot of time reading and writing data. The gap between data transfer rates and capacity growth on hard drives, as shown in the chart below, is clearly getting wider.

Currently, if you have a terabyte of data on your hard drive, it takes 2.5 hours to read all the data. Therefore, in the era of big data, low data read and write rates have become the bottleneck of data storage, retrieval and analysis. How to deal with this bottleneck? Some have floated the idea of multiple hard disks and distributed computing.

The figure below illustrates a rough model of multiple hard drives. The data is divided into several parts, and the data of each part is stored on a separate hard drive. When we want to analyze the data, each hard drive will read and process its own data separately, and then combine the results to reach a final conclusion. Therefore, a disk only needs to read part of the data, shortening the data read and write time.

However, this may pose a problem: if multiple drives are used to store only one data set, is it a waste of disk space?

This is certainly a waste of space, but think about it if we had multiple data sets (or multiple users on the system), we could do the same for all of them and use disk space more efficiently. In addition, as long as we do not analyze all the data sets at the same time, the data set read and write time is also reduced.

Then we might want to build such a system, but there are several problems. The first is about the hardware failure rate. Although the failure rate of an independent hard disk is very low, the failure rate of a group of hard disks will increase, because it only takes one hardware failure to cause a system failure. The second question is how do we combine data from multiple drives? There are many other issues that need to be addressed.

With that in mind, the Hadoop project is a system that provides concepts, ideas, and architectures for distributed computing, like an operating system designed for clustered computing, and libraries that allow users to write their own applications to interact with the system and analyze their own data sets.

Hadoop overview

Hadoop is a top-level project and consists of several sub-projects, each addressing an aspect of distributed computing. Some of the Hadoop sub-projects are listed below (not completely) :

  • HDFS: a distributed file system
  • MapReduce: a distributed data processing model and execution system
  • Pig: A data flow language and execution environment for exploring very large data sets
  • HBase: A distributed column-oriented data set
  • ZooKeeper: a distributed, highly available collaborative service
  • Hive: a distributed data warehouse

Below is a graphical representation of the Hadoop subproject:

The most important subprojects are HDFS and MapReduce for data storage and processing, which provide systematic solutions to these problems.

Hadoop components

HDFS

HDFS, short for Hadoop Distributed FileSystem, manages storage resources on the network of a machine and applies to:

  • Very large files (GB and TB in size)
  • Stream data access (write once read many times)
  • Commercial hardware (does not require expensive or special hardware)

Does not apply to:

  • Low latency data access (HDFS takes time to query file locations)
  • Lots of little files (billions of little files)
  • Multiple writers, arbitrary file modification. The main reason lies in the mechanism of HDFS. The architecture of HDFS is as follows:

The block size is the minimum amount of data that can be read or written to the system. Blocks from the same file are stored independently. As mentioned earlier, a group of hard drives is not trusted, so each block is copied to prevent hardware damage.

A Datanode is a worker machine that stores and retrieves file blocks and periodically reports its list of stored blocks to name-Node. A Datanode can be regarded as a computer on a network.

Name-node is the core and administrator of the system. It maintains the file system tree and metadata of all files and directories in the tree. The tree structure is shown as follows:

First, there is a root directory that contains multiple subdirectories. Each subdirectory contains multiple files, and each file is divided into multiple blocks. Therefore, Datanode periodically reports its block list to name-Node, and name-Node knows which Datanode each block is located at.

The following figure shows the reading process of HDFS data:

First, the system obtains the block list and block location through name-Node. Then, InputStream reads each block in the Datanode at the same time.

The HDFS data writing process is as follows:

First, the system uses name-node to check whether the file already exists in the system. If not, name-Node creates files in the file system tree and provides a list of blocks that can be stored in datanodes. The OutputStream then writes each block to the Datanode. Since each block needs to be copied multiple times, it is written to the Datanode’s pipeline. After being written to the first Datanode, it is passed to the second Datanode. Then there is the third datanode specifically specified by the user.

MapReduce

MapReduce is a parallel data processing model for distributed computing. The work in this model is divided into two steps, Map phase and Reduce phase. Each phase has input and input, and the data format is (Key,value)(key,value). As follows:

Here’s an example where we want to find the highest annual record temperature in the data set, and the input is a large file with various information for a specific day in each line. In the Map phase, we extract the year and temperature from each row, and since each row is a day’s data, the output in the Map phase includes many rows in a year. So we merge the output of the Map phase, which is called Shuffle. Then, the Reduce function is used to calculate the annual maximum temperature, and the specific process is as follows:

So how does MapReduce apply to distributed computing? In MapReduce, we define tasks to be done as jobs. A job can be divided into multiple tasks, including Map tasks and Reduce tasks. Thus the input data set can be divided into equal parts (as HDFS does), and each task processes one part of the data set. Jobtrack and Tasktrack are used to coordinate work, and the parallel computing process is shown as follows:

For each shard of the input data set, we apply the Map function, and then combine the map input as the reduce function input to obtain the result. It is worth noting that the Map function can be completed at the same time, while the Reduce stage is always after the Map stage. If there are multiple Reduce functions, the processing process is as follows:

Hadoop does its best to run input-partition map tasks on computers residing in HDFS, known as data localization optimization. This means that the computer on which the input partition is located is responsible for running the map task. If we were using another computer, the input partition would have to be transmitted over the network, which would consume bandwidth resources. Therefore, data localization reduces the bandwidth cost of data transmission. However, reduce jobs cannot be localized using data, and the output of all map functions must be transmitted to specific nodes to run reduce jobs. To minimize data transfer, the Combiner function is used between Map and Reduce to reduce unnecessary information in map output. Continuing with the maximum temperature example, a Combiner function can be used in the shuffle phase to calculate the local maximum annual temperature in each partition of the input data set.

The following figure shows the working process of a MapReduce job. A special ID is assigned to the job and job resources (such as configuration information) are obtained from the file system. After the job is submitted to Jobtracker, it is initialized and Jobtracker retrieves the input split from the file system and sets up a Map task for each split. The number of Reduce tasks depends on user Settings. After receiving tasks, Tracktracker periodically reports the tasks status to Jobtracker.

There is a new system called YARN that shares mapReduce-like functions, but I won’t discuss it because I’m not going into this subproject.

MapReduce handles low-level operations. With MapReduce alone, we have to care about the details of the data flow, so we need a more advanced tool to help us focus on the data itself and the technical details of processing. The next section discusses three high-level subprojects.

Other sub-projects

Pig

Pig is a language and operating environment for processing large data sets. Pig has two parts: Pig translation: a language for representing data streams; And a runtime environment to run Pig Latin.

Pig Latin programs consist of a series of operations or transformations applied to input data, which can be converted into a series of MapReduce tasks using the Pig runtime environment, so that Pig Latin can be understood by Hadoop systems. Continuing with the maximum temperature example, we can do this with Pig in very few rows:

HBase

HBase is a column-oriented distributed database built on the HDFS. Although it is a database, but it doesn’t support SQL, the idea is that if the form growth is too large, it can be in a horizontal direction is automatic cutting for multiple regions (each area including multiple rows) in the form, and HDFS with graphs, it has a master-slave architecture, using management node to manage all the space, And RegionServer follows one administration or more regions.

Hive

Hive is a data warehouse infrastructure built on Hadoop. It supports Hive Query language (very similar to SQL). If you are familiar with the database, you can quickly learn Hive.