This article will explain some of the basic concepts of Hadoop and explain the limitations of MapReduce, leading to the core ideas of Spark.

This article was first published on wechat public account [Data-oriented Programming]

Hadoop was born as a big data technology. After years of development, Hadoop is no longer a single technology, but a complete big data ecosystem.

Hadoop is a distributed system in nature. Because a single machine cannot complete the storage and processing of big data, data needs to be stored in different machines, and users can access and operate the data just as they access the data of a single machine. To achieve this task, Hadoop proposed two concepts: HDFS and MapReduce.

HDFS

The distributed data storage scheme is used to store a large amount of data in a cluster composed of multiple machines, with each machine storing part of the data.

Assuming that the Data set to be stored is on the left, the HDFS cluster contains storage nodes, namely Data nodes 1, 2, and 3 on the right, as well as a Name Node, which is used to store the location of each Data block. For example, we now need to access the blue data block and the green data block, divided into the following steps:

  • The client sends a request to the Name Node to obtain the location of blue data blocks and green data blocks

  • Name Node returns the addresses of Data Node1 and Data Node2

  • The client accesses Data Node1 and Data Node2

If we want to add data to the cluster, the steps are as follows:

  • The client sends a write request to the Name Node

  • Name Node acknowledges the request and returns the Data Node address

  • The system starts to write data to the destination address, and the corresponding machine returns a message confirming the success of the write

  • The client sends confirmation information to the Name Node

It can be seen that the most critical Node in the whole cluster is Name Node, which manages the information of the entire file system and schedules the corresponding file operations. Of course, a cluster does not have to have only one Name Node. If there is only one Name Node, the whole cluster will stop working when it becomes unserviceable.

The above concepts and operations such as data store access are only the simplest cases. The actual situation is much more complex. For example, the cluster also needs to back up data.

MapReduce

MapReduce is an abstract programming model that simplifies distributed data processing into two operations, Map and Reduce. Before graphs appear, distributed clusters of data processing is very complicated, because if we want to make the distributed cluster to complete a task, you first need to put these tasks down into many subtasks, and then to the child tasks assigned to a different machine, finally completed the task, needs to be the result of a task to merge, summary, and so on.

MapReduce abstracts this process by dividing machines into two categories: Master and Worker. The Master is responsible for scheduling work, while the Worker is the machine that actually performs the task. Workers can also be divided into two types, Mapper and Reducer. Mapper is responsible for executing sub-tasks, and Reducer is responsible for summarizing execution results of Mapper.

We can use A simple example to explain this process, for example, now we need to count the number of ACES from A deck of playing cards, so we will divide the playing cards into several pieces, each person (Mapper) counts the number of ACES from the deck, there is A number of cards 1, there is A number of cards 2. Finally, when each number is exhausted, the result is summed up (Reduce), which is the number of ACES in the whole deck.

Of course, the real task is not only those two operations, but also Split, or Split data, and Shuffle, or sort data. The design of these operations is also particularly subtle, and if poorly designed, can affect the performance of the entire system.

  1. Reduce needs to be completed after Map. If data is not properly divided, the whole process will be delayed greatly
  2. Map and Reduce are unable to handle complex logic
  3. Performance bottleneck: Because intermediate results of MapReduce processing need to be stored in HDFS, the write and write time greatly affects performance
  4. The delay of each task is huge, so it is only suitable for batch data processing, not for real-time data processing

Spark

Spark solves these problems to some extent and can be used as a substitute for MapReduce. Far faster than Hadoop’s MapReduce,

To achieve this one-step task, which does not require multiple reads and writes from the hard disk, Spark proposes a new idea, namely RDD, based on distributed memory data abstraction.

Resilient Distributed Datasets is the full name of RDD. Based on THE RDD, Spark defines many data operations, which greatly improves the logical presentation capability compared with MapReduce.

Of course, RDD is a very difficult concept to understand, it is not a physical thing, but a logical concept, in the actual physical storage, real data is still stored in different nodes. It has the following features:

  • partition
  • immutable
  • Parallel operation

partition

Partitioning means that data in the same RDD is stored on different nodes in the cluster, and this feature ensures that it can be processed in parallel. As mentioned earlier, RDD is a logical concept. It is just an organization of data. We can use the following figure to illustrate this organization structure:

immutable

Each RDD is read-only and contains partition information that cannot be changed. Since an existing RDD cannot be changed, a new RDD is generated as a result of each operation on the data. Every time the new RDD, we need to record it is transformed using which RDD operation, so the new and old RDD existence dependency relationship, a benefit is don’t need to do this it will produce storage of the data results of each step, if one step fails, the only need to roll back to its previous step RDD again, without the need for repeat all of the operations. The details of the dependency are not elaborated here. The implementation logic is complicated, and there will be an article devoted to it later.

Parallel operation

As mentioned earlier, data in the same RDD is stored on different nodes in the cluster, and this feature ensures that it can be processed in parallel. Because data from different nodes can be processed separately,

For example, a group of people are holding several kinds of fruits in their hands. If they want to peel these fruits in order of species, such as apples first, pears then peaches last, it must be one kind of fruit in different hands to complete the parallel task. If one has apples and one has pears, one can only peel off the other.

conclusion

Spark makes several improvements over MapReduce, resulting in a significant performance increase.

  • Spark stores operation data in memory instead of hard disk, which greatly improves read and write speed
  • The results generated by each operation in the Spark task do not need to be written to disks, but only the dependency relationship between operations is recorded. This improves the fault tolerance rate and greatly reduces the cost of recovery tasks
  • Use partitions so that data can be processed in parallel