1. Basic concepts of Spark RDD

1. Comparison between MapReduce and Spark

1.1 What is MapReduce

MapReduce is a computing model that breaks large data into many individual tasks that are executed in parallel in a cluster and then combines the results to produce the final result. For a detailed introduction to MapReduce, see the previous article (3) for an easy understanding of how MapReduce works.

MapReduce solves many scenarios in big data, but its limitations are obvious:

MapReduce provides only map and Reduce operations.
The intermediate results are stored in the HDFS file system, which is inefficient for iterative calculation
Suitable for batch data processing, not enough support for interactive data processing
It takes a lot of low-level code to get started.

Here is MapReduce calculating WordCount:

1.2 What is Spark

Spark is a distributed computing based on the MapReduce algorithm, which is improved from Hadoop. It has the advantages of MapReduce. Unlike MapReduce, however, the intermediate output and result of a Job can be stored in memory, which eliminates the need to read HDFS. Therefore, Spark is more suitable for MapReduce algorithms that require iteration, such as data mining and machine learning.

However, Spark is only designed for computing, not data storage. Therefore, it needs to be connected to an external data source, such as HDFS.

Today, Spark is not just a replacement for MapReduce. It has evolved into a Spark ecosystem with many sub-projects. As shown in the figure, the Spark ecosystem can be divided into four layers:

Data storage layer: some distributed file storage systems or various databases represented by HDFS.
Resource management layer and resource managers such as Yarn
Data processing engine
Rich class library support, including SQL, MLbin, GraphX, Spark Streaming, can be combined seamlessly.

2. Features of Spark

In big data storage, computing, and resource scheduling, Spark mainly solves computing problems and replaces MapReduce. Many companies still use HDFS and Yarn to carry the underlying storage and resource scheduling. What are the features of Spark that will make Spark the processing engine of choice for many enterprises in the Hadoop ecosystem?

Speed is fast. Spark computes based on memory.
Easy to develop. The RDD-based calculation model of Spark is easier to understand and develop than the MapReduce calculation model, enabling a variety of complex functions.
Strong versatility. Spark provides Spark RDD, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX and other technical components, enabling you to perform common tasks such as offline batch processing, interactive query, Streaming computing, machine learning, and graph computing in the field of big data in a one-stop manner.
Integrated Hadoop. Spark is a perfect successor to Hadoop. HDFS of Hadoop, Hive, HBase for storage, Yarn for resource scheduling, and Spark for big data computing are popular big data solutions.
High community activity.

Spark Code interview questions

MapReduce and Spark are parallel computing. What are the similarities and differences between them?

A: Both use MapReduce for parallel computing.

(1) A MapReduce job is called a job. A Job is divided into Map task and Reduce task. Each task runs in its own process and the process ends when the task ends. For Spark, tasks submitted by users are called Applications. Each application corresponds to a SparkContent. Each action triggered generates a job, so multiple jobs exist in the application and can be executed in parallel or serial. Each job has multiple stages. Stages are derived from DGASchaduler in Suffle by classifying jobs based on the dependency relationship between RDD. Each stage has multiple tasks. These tasks form a Taskset and are distributed to each executor by TaskSchaduler. An executor’s life cycle is the same as an app’s, even if there is no job running, so tasks can be quickly started to read memory for calculation.

(2) MapReduce jobs only have Map and Reduce operations. The expression capability is poor and HDFS is read and written repeatedly during THE Mr Process, resulting in a large number of I/O operations. Spark two-point iterative calculation is performed in memory. The API provides a large number of RDD operations, such as join and groupby, and the DAG diagram has good fault tolerance.

3. RDD

Resilient Distributed Dataset (RDD) Resilient Distributed Dataset (RDD) is the most basic data abstraction in Spark. It represents an immutable and partitioned Dataset in which elements can be computed in parallel.

Resilient: Resilient. RDD data can be saved in the memory or disk.

Distributed: Stores internal elements in Distributed mode, facilitating Distributed computing.

Dataset: a collection of stored data

In code, the result of each method is an RDD, and the result of the next RDD depends on the previous RDD. During RDD execution, A DAG diagram is formed and lineage is formed to ensure fault tolerance.

3.1 Five features of RDD

A list of partitions: A list of partitions that comprise the data of the RDD.
A function for computing each split: The computing function for each partition counts as an RDD
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) : partitioned functions for key-value types Optionally
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) : The computing task location is the location where each Partition is stored

3.2 RDD Creation

Build from an existing Scala collection:

Val rdd1=sc.parallelize(List(1,2,3,4,5)) val rdd2=sc.parallelize(Array("zookeeper","kafka","spark")) val Rdd3 = sc. MakeRDD (List (1, 2, 3, 4))Copy the code

Load external data sources to build:

val rdd1=sc.textFile("/words.txt")
Copy the code

Generate a new RDD by converting from an existing RDD:

val rdd2=rdd1.flatMap(_.split(" "))
val rdd3=rdd2.map((_,1))
Copy the code

3.3 Operator classification of RDD

3.3.1 Transformation

A new RDD is generated based on an existing RDD transformation, it is lazily loaded, and it does not execute immediately. Such as Map, flatMap and reduceByKey used in wordCount.

3.3.2 Action

It actually triggers the task to run. Returns the RDD calculation result data to the Driver or saves the result data to an external storage medium. For example, Collect and saveAsTextFile are used in wordCount.

3.4 Common operators of RDD

3.4.1 track transformation operator

map(func)	Returns a new RDD consisting of each input element transformed by the func function
filter(func)	Returns a new RDD consisting of input elements evaluated by the func function that return true
flatMap(func)	Similar to a map, but each input element can be mapped to zero or multiple output elements (so func should return a sequence, not a single element)
mapPartitions(func)	Iterator[T] => Iterator[U]; Iterator[U] => Iterator[U]
mapPartitionsWithIndex(func)	Similar to mapPartitions, but func takes an integer argument to represent the index value of partitions, so when running on RDD of type T, func must be of type (Int, Interator[T]) => Iterator[U].
union(otherDataset)	The union of the source RDD and the parameter RDD returns a new RDD
intersection(otherDataset)	The intersection of the source RDD and the parameter RDD returns a new RDD
distinct([numTasks]))	A new RDD is returned after the source RDD is de-duplicated
groupByKey([numTasks])	Called on an RDD of (K,V), returns an RDD of (K, Iterator[V])
reduceByKey(func, [numTasks])	Returns an RDD of (K,V). Aggregates the values of the same key using the specified Reduce function. Similar to groupByKey, the number of Reduce jobs can be set using the second optional parameter
sortByKey([ascending], [numTasks])	Called on an RDD of (K,V), K must implement the Ordered interface to return a RDD of (K,V) Ordered by key
sortBy(func,[ascending], [numTasks])	Similar to sortByKey, but more flexible
join(otherDataset, [numTasks])	Called on RDD of type (K,V) and (K,W), returns a RDD of (K,(V,W) with all elements of the same key together
cogroup(otherDataset, [numTasks])	Called on RDD of type (K,V) and (K,W), returns an RDD of type (K,(Iterable,Iterable)
coalesce(numPartitions)	Reduces the number of partitions in the RDD to the specified value.
repartition(numPartitions)	Repartition the RDD
repartitionAndSortWithinPartitions(partitioner)	Repartition the RDD and sort it by record key within each partition

3.4.2 action operator

reduce(func)	Reduce passes the first two elements in the RDD to the input function to generate a new return value, which forms two elements with the next element (the third element) in the RDD and is passed to the input function until there is only one value.
collect()	In a driver, all elements of a data set are returned as an array
count()	Returns the number of elements of RDD
first()	Return the first element of the RDD (similar to take(1))
take(n)	Returns an array consisting of the first n elements of the dataset
takeOrdered(n, [ordering])	Returns the first n elements of natural or custom order
saveAsTextFile(path)	Save the data set elements as textFiles to the HDFS file system or other supported file systems. For each element, Spark calls the toString method to replace it with text in the file
saveAsSequenceFile(path)	You can save elements in a data set in Hadoop Sequencefile format to a specified directory to enable the HDFS or other file systems supported by Hadoop.
saveAsObjectFile(path)	Saves the elements of a data set to a specified directory as Java serialization
countByKey()	For RDD of type (K,V), return a map of (K,Int), representing the number of elements corresponding to each key.
foreach(func)	On each element of the dataset, run the function func
foreachPartition(func)	On each partition of the dataset, run the function func

The interview questions

Which of the following is not a feature of RDD (C)

A. partition B. serializable C. modifiable D. persistent

What is the elasticity of RDD?

Automatically switch between memory and disk storage;

Efficient fault tolerance based on Lingage;

The task is automatically retried a specified number of times if it fails.

Stage automatically retries a certain number of times if it fails, and only the failed shards are counted.

Checkpoint and persist, persistent caching after data is computed

Data scheduling flexibility, DAG TASK scheduling is resource independent

The high elasticity of data shards, many pieces can be combined into large shards

What are the drawbacks of RDD?

Fine-grained write and update operations (such as web crawlers) are not supported. Spark writes data coarse-grained. Coarse-grained data is written in batches to improve efficiency. But read data is fine-grained that is

He said he could read them one by one.

Incremental iterative calculation is not supported, Flink does

What are the ways to create an RDD?

Create an RDD using collections in your program

Create an RDD using a local file system

Create an RDD using HDFS

Create an RDD based on the database

Create RDD based on Nosql, such as hbase

Create an RDD based on S3

Create RDD based on data flow, such as socket

4. Spark architecture

SparkContext is created by the driver in the Spark cluster. SparkContext coordinates the executors on each Worker Node and generates several workers according to user input. The Worker Node runs several executors. An executor is a process that runs separate tasks, each executing the same snippet of code that processes different pieces of data.

5. Lineage

Lineaga of the RDD records metadata information and conversion behavior of the RDD. Lineaga saves the dependency relationship of the RDD. When data of some RDD partitions is lost, lineaga can recalculate and recover the lost data partitions based on the information. It is important to note that without human intervention in the recovery of partitioned data, the program itself can help us recover according to the lineage of the RDD.

5.DAG

A DAG(Directed Acyclic Graph) is called a Directed Acyclic Graph. The original RDD is transformed into a DAG through a series of transformations.

5.1 Narrow and wide dependencies

Narrow dependency donor Each parent RDD partition can be used by a maximum of one RDD partition, such as Map, flatmap, filter, and Union. Narrow dependency does not generate shuffle.

Dependence refers to the wide more child RDD partition rely on the same parent RDD partition, such as reduceByKey/softByKey/groupBy/groupKeyBy/join, such as wide rely on suffle is produced.

The type of 5.2 stages

ShuffleMapStage: All transformations before the last shuffle are called ShuffleMapStage, and the corresponding task is shuffleMapTask

ResultStage: The operation after the last shuffle is called ResultStage, which is the last Stage. Its corresponding task is ResultTask

5.3 Why we divide stages

DGA can be divided into different stages according to the dependencies between RDD. For narrow dependencies, partition conversion is completed in a stage. For wide dependencies, due to shuffle, the next computation can only be processed after the parent RDD processing is complete.

After the stage is divided, pipelined computing can be realized because there are only narrow dependencies but no wide dependencies in the same stage. Each partition of the stage corresponds to a task, and there are many tasks that can be run in parallel in the same stage.

5.4 How to divide stages

Stages are divided according to wide dependencies:

Firstly, DAG directed acyclic graph is generated according to the operator operation sequence of RDD, and a new stage is created by pushing forward from the last RDD, and the RDD is added to the stage. At this moment, this is the last stage.
In the process of moving forward, the RDD is added to this stage when the operation encounters a narrow dependency. If the operation encounters a wide dependency, the RDD is cut from the position of the wide dependency, and the last stage is also divided.
Create a new stage and continue with the second step, all the way to the original RDD. The whole process of dividing stages is accepted.

5.5 The relationship between stages

After stages are divided, there are many tasks in each stage that can be run in parallel. Later, tasks of each stage are encapsulated in a Taskset. Finally, a Staskset of one is submitted to the Executor of Wroker node for running.

7. Caching mechanism of RDD

Cache the RDD data so that the RDD result data will be used by other jobs in the future can be obtained directly from the cache to avoid double calculation and speed up subsequent access to the data.

RDD sets the persist and cache methods to cache the results of previous calculations, but it is important to note that the RDD is not cached immediately when these methods are called. Instead, when subsequent actions are triggered, the RDD is cached in memory for later calls.

The difference between persist and cache: Cache by default places data in memory, which is essentially a call to PEisist. Persist, on the other hand, can store data in memory or on disk and has a rich cache level.

7.1 Timing of cache use

When RDD3 is needed, RDD1 is calculated from RDD1, RDD2 is obtained after operator operation, and RDD3 is obtained after RDD2 calculation. Similarly, RDD4 restarts from RDD1.

By default, the parent RDD that precedes this RDD is recalculated. This is also common in real development, but it is important to avoid recalculating the same RDD multiple times, which can lead to a sharp performance degradation.

When obtaining the result data of an RDD requires a lot of operator operations or calculation logic, we can set cache to persist the RDD that has been used for many times to improve efficiency.

7.2 Clearing Cached Data

Automatic cleanup: After an application application ends, the corresponding cached data is automatically cleared

Manual cleanup: Call the unpersist method

Although RDD data can be cached and stored in memory or disk, it is not particularly safe.

The cache stores data directly in memory, which can result in data loss if a server hangs or a process terminates.

Persist stores data on a local disk and may cause data loss if the disk is damaged.

8. Checkpoint mechanism of RDD

Checkpoint provides a more reliable way to persist data. Data is stored in a distributed file system such as HDFS to maximize data security with high fault tolerance (multiple copies).

8.1 set up:

1. Set a checkpoint directory on the HDFS

sc.setCheckpointDir("hdfs://node1:9000/checkpoint") 
Copy the code

2. Invoke the checkpoint method on the RDD that needs to be checkpoint

val rdd1=sc.textFile("/words.txt")
rdd1.checkpoint
val rdd2=rdd1.flatMap(_.split(" ")) 
Copy the code

3. An action action is required to trigger the running of a task. Each action corresponds to a job.

8.2 Difference between cache, persist, and checkpoint

The cache and persist

Cache data is cached in memory by default. Persist stores data in memory or disk. To trigger cache and persist, an action action is required, which does not start new tasks.

It does not change the RDD dependencies, and the corresponding cached data will disappear automatically after the program is run.

checkpoint

Data can be persistently written to the HDFS. To trigger the checkpoint persistence operation, an action is required, and a new job is started to perform the checkpoint operation.

It will change the RDD dependencies. If the data is lost later, it cannot restore the data by lineage (because it determines that you have persisted in HDFS, the dependency is deleted). The corresponding checkpoint data will not disappear after the program is run.

9. Why is Spark good at iterative computation

MapReduce performs an iterative process of Pagerank algorithm, and it should be noted that the gray part is the data that needs to be stored to disk:

Spark performs an iterative process of the pageRank algorithm and makes many improvements over MapReduce:

First, Spark allows users to cache commonly used data in the memory when the memory is sufficient, speeding up the system running speed.

Second, Spark has a clear division of data dependencies. Tasks are scheduled based on wide and narrow dependencies, which enables pipelined operations and improves system flexibility.

Second iteration of Pagerank algorithm by MapReduce:

Spark performs the second iteration of Pagerank algorithm:

As shown in The figure, Spark can allocate RDD partitions with narrow dependency relationships to a task for pipelination. Data in a task does not need to be transmitted over the network and tasks do not interfere with each other. Therefore, two iterations of Spark contain only three shuffles.

During an iteration, MapReduce and Spark may not have much difference in performance, but as the number of iterations increases, the difference becomes more and more obvious. The task scheduling policy adopted by Spark based on dependencies significantly reduces the number of shuffle operations compared with MapReduce operations. Therefore, Spark is suitable for iterative operations.