Learn the basic principles of Spark in 1 hour

1. Spark’s advantages and Features

As the successor to MapReduce, the big data computing framework, Spark has the following advantages.

1. Efficiency

Unlike MapReduce, which stores intermediate calculation results on disks, Spark uses memory to store intermediate calculation results, reducing disk I/OS for iterative operations. It also optimizes DAG graphs in parallel computing, reducing the dependency between different tasks, and reducing latency. In memory calculation, Spark is 100 times faster than MapReduce.

2. Ease of use

Unlike MapReduce, which supports only Map and Reduce, Spark provides more than 80 Transformation and Action operators. Such as map, reduce, filter, groupByKey sortByKey, foreach, etc., and USES the functional programming style, great amount of code you need to realize the same function.

3. Versatility

Spark provides a unified solution. Spark can be used for batch processing, interactive queries (Spark SQL), real-time Streaming (Spark MLlib), machine learning (Spark MLlib), and graph computing (GraphX). These different types of processing can all be used seamlessly within the same application. This enables enterprise applications to use a single platform for different engineering implementations, reducing manpower development and platform deployment costs.

4. Compatibility

Spark is compatible with many open source projects. For example, Spark can use Hadoop YARN and Apache Mesos as its resource management and scheduler, and Spark can read multiple data sources, such as HDFS, HBase, and MySQL.

Basic Concepts of Spark

Resilient Distributed Dataset (RDD) Resilient Distributed Dataset (RDD) is an abstract concept of Distributed memory that provides a highly constrained shared memory model.
DAG: stands for Directed Acyclic Graph (Directed Acyclic Graph), reflecting the dependencies among RDD.
Driver Program: The control Program that builds the DAG diagram for the Application.
Cluster Manager: a Cluster resource management center that allocates computing resources.
Worker Node: The work Node that performs specific calculations.
Executor: A process running on a Worker Node that runs tasks and stores data for applications.
Application: Spark Application written by users. An Application contains multiple jobs.
Job: a Job. A Job contains multiple RDDS and operations that operate on the corresponding RDDS.
A job is divided into groups of tasks. Each group of tasks is called a “Stage”.
Task: a unit of work that runs on an Executor and is a thread in an Executor.

Summary: Application is composed of multiple jobs, Job is composed of multiple stages, and Stage is composed of multiple tasks. Stage is the basic unit of job scheduling.

3. Spark architecture design

The Spark Cluster consists of the Driver, Cluster Manager (Standalone,Yarn or Mesos), and Worker Node. For each Spark application, there is an Executor process on the Worker Node, and the Executor process contains multiple Task threads.For PySpark, Spark wraps a Layer of Python APIS around it so as not to break Spark’s existing runtime architecture. On the Driver side, Py4j is used to implement the interaction between Python and Java, so that Spark applications can be written using Python. On the Executor side, Py4j is not needed because the Task logic that the Executor side runs is sent from the Driver, which is the serialized bytecode.

4. Spark Operation process

1. Application is first constructed by Driver and decomposed into stages.
2, the Driver then applies for resources from the Cluster Manager.
3. The Cluster Manager sends a call signal to some Work nodes.
4. The recruited Work Node starts the Executor process and requests tasks from the Driver.
5. The Driver assigns tasks to the Work Node.
Executor executes tasks in units of stages, while the Driver monitors them.
7. The Driver receives a signal that the Executor task is complete and sends a logout signal to the Cluster Manager.
8. The Cluster Manager sends a signal to the Work Node to release resources.
9. The Executor corresponding to the Work Node stops.

5. Spark deployment mode

Local: the Local running mode is not distributed.
Standalone: The Standalone cluster manager is provided by Spark. After the deployment, only Spark tasks can be run.
Yarn: Haoop cluster manager, which can run MapReduce, Spark, Storm, and Hbase tasks at the same time.
Mesos: The biggest difference from Yarn is that Mesos allocates resources twice. Mesos allocates resources once. The computing framework can choose to accept or reject the allocation.

6. RDD data structure

RDD Resilient Distributed Dataset (RDD) Resilient Distributed Dataset is a collection of recorded read-only partitions and the basic data structure of Spark. RDD represents an immutable, partitioned collection whose elements can be computed in parallel. There are generally two ways to create an RDD. The first is to read data from a file to generate an RDD, and the second is to parallelize objects in memory to generate an RDD.

RDD = sc.textFile(" HDFS :// Hans /data_warehouse/test/data" sc.parallelize(arr)Copy the code

After you create an RDD, you can program the RDD using a variety of operations. There are two types of RDD operations, Transformation and Action. A transformation creates a new RDD from an existing RDD, while an action evaluates on the RDD and returns the result to the Driver. The Transformation operation is Lazy, which means Spark does not perform the actual calculation immediately. Instead, it records the execution path. Only when the Action is triggered, it is executed according to the DAG diagram.

Actions determine the dependencies between RDD’s. There are two types of RDD dependencies, narrow and wide. In the case of narrow dependence, the partitions of the parent RDD and the partitions of the child RDD are one-to-one or many-to-one. In the case of wide dependency, the partition of the parent RDD and the partition of the self-RDD are one-to-many or many-to-many. Broad-dependence-related operations typically have a shuffle process, which is to use a Patitioner function to distribute records with different keys on each partition in the parent RDD to different child RDD partitions.

Dependencies determine how a DAG cuts into stages. Cutting rules: Cut stages from back to front when wide dependencies are encountered. The dependencies between RDD form a DAG directed acyclic graph. DAG will be submitted to DAGScheduler, which will divide DAG into multiple interdependent stages based on the wide and narrow dependencies between RDD. When wide dependencies are encountered, stages are divided, each containing one or more task tasks. These tasks are then submitted to the TaskScheduler to run as tasksets.

7. WordCount example

# 5 lines of code to complete WordCount frequency statistics. rdd_line = sc.textFile("/home/kesci/input/eat_pyspark9794/data/hello.txt") rdd_word = rdd_line.flatMap(lambda x:x.split(" ")) rdd_one = rdd_word.map(lambda t:(t,1)) rdd_count = rdd_one.reduceByKey(lambda x,y:x+y) rdd_count.collect() [('world', 1), ('love', 3), ('jupyter', 1), ('pandas', 1), ('hello', 2), ('spark', 4), ('sql', 1)] [('world', 1), ('love', 3), ('jupyter', 1), ('pandas', 1), ('hello', 2), ('spark', 4), ('sql', 1)]Copy the code