The design concept and basic architecture of Spark

The characteristics of the Spark

Compared with the MapReduce process of Hadoop, Spark has the following features

Reduce disk I/O
- The Map side of Hadoop MapReduce stores intermediate output and results on disks, and the Reduce side reads and writes disk results from disks. As a result, disk I/O performance becomes a bottleneck of Hadoop. Spark allows the map side to store the intermediate output and result in memory, and the Reduce side avoids a large amount of disk I/O when pulling the intermediate result.
- After ApplicationMaster in Hadoop Yarn applies for a Container, the NodeManager is used to download resources (such as Jar packages) required by the task from different HDFS nodes. This also causes disk I/O. Spark buffers the resource files uploaded by applications to the memory of the Driver local file service. Excutor directly reads the resource files from the memory of the Driver when executing a task.
Avoid recalculation

If a Task fails to execute in a partition of the Stage, the Stage will be rescheduled. However, during rescheduling, the partition tasks that have been successfully executed will be filtered, so there will be no double calculation
The shuffle operation is optional

Hadoop MapReduce has fixed sorting operations before Shuffle, whereas Spark can select sorting on the Map side or Reduce side based on different scenarios
Flexible memory management strategy

Spark divides the memory into four parts: storage memory on the heap, storage memory outside the heap, execution memory on the heap, and execution memory outside the heap. Spark provides both a fixed boundary between execution memory and storage memory and a dynamic boundary between execution memory and storage memory. Spark uses dynamic boundaries by default. If the execution memory and storage memory are insufficient, either of them can borrow the memory of the other to maximize resource utilization
Checkpoint support

Lineage is maintained among Spark RDD. Once an RDD fails, the parent RDD can rebuild it. But for a long lineage, the reconstruction process is very time-consuming. If checkpoints are used, SparkContext will save the results of RDD calculations to checkpoints after all tasks in stages are successfully executed, so that if an RDD execution fails, data can be recovered directly from the checkpoint

Basic concepts of Spark

RDD

Resillient Distributed Dataset (RDD) represents an immutable partition set of elements that can be operated in parallel. RDD is a fault-tolerant, parallel data structure that controls the storage of data to disk or memory and the partitioning of data that can be retrieved. An RDD contains one or more partitions, each of which is a fragment of a data set.

RDD includes the following elements

The calculation function that operates on the RDD
Partition list
RDD dependencies (the top-level RDD dependencies are empty lists)
Partition calculator, a function that calculates partitions
Partition location awareness list (specific location of each partition)

Calculation function

Transformation

The input and output of the transformation operation are RDD, and the transformation operation executes lazily, that is, when the Action operation is encountered
Action

The Action Action returns the result to the Driver or writes it to the storage system. The return value is not RDD. The Action Action triggers the conversion operation

Dependency

Each RDD has dependencies, which can be classified as NarrowDependency and shuffle dependency (also known as wide dependency).

Narrow dependencies: narrow dependencies are OneToOneDependency and RangeDependency. The relationship between the parent RDD and child RDD partitions is one-to-one. That is, data in a partition of the parent RDD can correspond to only one partition of the child RDD. Shuffle does not occur when the narrow dependence occurs. Narrow dependency supports parallel execution of different operators in the form of pinple on one Excutor. When the data of the child partition is lost or the calculation is wrong, only the corresponding parent partition needs to be recalculated.

Wide dependence: The relationship between parent RDD and child RDD partitions is one-to-many. That is, each partition of the parent RDD may correspond to multiple child RDD partitions. Shuffle occurs when the wide dependence occurs. The RDD of wide dependence cannot be calculated until all parent RDD partitions are complete. If the data of the child partition is lost or the calculation is incorrect, all dependent partitions need to be recalculated.

Job Stage Task

The relationship between Job, Stage and Task

Job indicates the Job submitted by the user. For each action, Spark starts a Spark Job.
Stage Indicates the execution Stage of the Job. Spark divides a Job into different stages based on shuffleDependency. Shufflemapstages are upstream of wide dependency and Resultstages are downstream.
Task Indicates a specific Task to be executed. A Job creates multiple tasks in each Stage based on RDD partitioned data. Tasks in ShuffleMapStage are called ShuffleMapTask, and tasks in ResultStage are called ResultTask.
The relationship among the three is that a Job is divided into multiple stages, and each Stage has multiple tasks.

DAG

The Directed Acycle Graph (DAG) is a Directed acyclic graph composed of edges and vertices. Spark uses DAG to reflect the dependencies and kinship relationships between RDDS.

When an Action Action is triggered, Spark submits the job to DAGScheduler, which then divides the job into stages based on wide dependencies. Stages are submitted to the TaskScheduler, The TaskScheduler submits tasks to various nodes in the cluster

Spark model design

Programming model

The entire process of Spark application from program writing to submission, execution, and output is as follows

Write the Driver application using the API provided by SparkContext
After submitting the user application using SparkContext
- Register an Application to apply for resources: The Application is registered with the Cluster Manager through RpcEnv and the Cluster Manager is informed of the amount of resources required
- Resource allocation: The cluster manager assigns Excutor resources to an Application based on its requirements. TaskScheduler saves the address, size, and other information of Excutor resources allocated to the Application
- Broadcast configuration information: SparkContext uses BlockManager and BroadcastManager to broadcast the configuration of the task before the RDD transformation begins
- Build DAGS: SparkContext builds the relationships and DAGs between RDDS according to various transformation apis, and the DAGS composed of RDDS are submitted to DAGScheduler.
- Task division: DAGSchduler creates jobs based on the submitted DAG and divides stages. Multiple tasks are created based on the Partition number of RDD in the stages and submitted to the TaskScheduler in batches
- Task scheduling: The TaskScheduler schedules tasks based on FIFO or FAIR, allocates Executor resources to the tasks, and sends the tasks to Excutor for execution

RDD calculation model

RDD can be regarded as a unified abstraction of various data calculation models. Spark’s calculation process is mainly the iterative calculation process of RDD.

The number of partitions depends on the number of partitions set. Data for each Partition is computed in only one Task, and all partitions can be executed in parallel on the Excutor of multiple machine nodes.

Basic Architecture of Spark

Spark Cluster consists of Cluster Manager, Worker, Excutor, Driver, and Application

Cluster Manager

A cluster manager allocates and manages cluster resources. It is ResourceManager in Yarn mode. Resources allocated by Cluster Manager belong to first-level allocation, which allocates memory, CPU and other resources on each Worker to Application.
Worker

Working node: NodeManager in Yarn mode. Worker is mainly responsible for informing Cluster Manager of its own memory and CPU resources through registration mechanism. Create Excutor; Further allocate resources and tasks to executors; Synchronize resource information, Executor status information to Cluster Manager
Excutor

A first-line component that performs computing tasks, mainly responsible for task execution and information synchronization of Worker and Driver
Driver

The Application Driver communicates with the Cluster Manager and Executor through the Driver
Application

Users write applications using the apis provided by Spark. The Application uses the Spark API to convert RDD and build DAG, and registers the Application with the Cluster Manager using the Driver. The Cluster Manager allocates Executor, memory, and CPU resources to applications based on their resource requirements. The Driver assigns resources such as Executor to each task through secondary allocation, and the Application tells the Executor to run the task through the Driver.

Module design of Spark

Spark consists of Spark Core, Spark SQL, Spark Streaming, GraphX, and Mlib. Spark Core, Spark Streaming, GrapX, Mlib capabilities are built on Spark Core.

The basic and core functions of Spark are as follows

Basis function

Spark’s infrastructure includes
- The Spark configuration (SparkConf) is used to manage the configuration information of the Spark application
- The built-in RPC framework of Spark enables the communication between Spark components through RPC
- RPC is a communication facility between different components across machine nodes, and the event bus is implemented asynchronously between components within SparkContext using the event-listener mode
- The measurement system consists of multiple measurement sources and outputs of Spark to monitor the running status of each component in the Spark cluster.
SparkContext

SparkContext is used to write applications and submit jobs. SparkContext is an interface for users to use Various functions of Spark. You only need to invoke the API to complete the development of various functions
SparkEnv

Spark Execution environment SparkEnv contains various components for Task execution, such as RPC environment, serialization manager, broadcast manager, MAP Task output tracker, storage system, and measurement system
Storage system

Spark preferentially uses the memory of each node for storage. Disks are used only when the memory is insufficient. Spark’s memory storage space and execution storage space have no clear boundary and use the same memory. In this way, memory resources required by both sides can be dynamically balanced, improving resource utilization efficiency and task execution efficiency.
Scheduling system

The scheduling system is mainly composed of DAGScheduler and TaskScheduler, which are built into SparkContext.
- DAGScheduler is responsible for Job creation, dividing RDD in DAG into different stages, creating corresponding tasks for stages, and submitting tasks in batches.
- TaskScheduler is responsible for scheduling batch tasks according to FIFO or FAIR scheduling algorithm. Assign resources to tasks; Sends tasks to executors that cluster manager has assigned to the current application for execution.
Calculation engine

The computing engine consists of MemoryManager, Tungsten,Task MemoryManager,Task, ExternalSorter, Shuffle manager and so on.
- MemoryManager provides support and management for storage memory in the storage system as well as execution memory in the computing engine
- Tungsten can also be used for computation or execution in addition to storage
- TaskMemoryManager provides more fine-grained management and control of memory resources allocated to individual tasks
- ExternalSorter is used on the Map or Reduce end to sort and aggregate intermediate results calculated by ShuffleMapTask
- ShuffleManager is used to persist the intermediate results generated by ShuffleMapTask in each partition to disks and remotely pull the intermediate results generated by ShuffleMapTask on the Reduce end by partition