Spark Core Provides the basic and Core functions of Spark, including the following functions:

(1) SparkContext:

Typically, the execution and output of a Driver Application is done through SparkContext. Before formally submitting the Application, you first need to initialize the SparkContext. SparkContext hides network communication, distributed deployment, message communication, storage capability, computing capability, cache, measurement system, file service, and Web service. Application developers only need to use the API provided by SparkContext to complete function development. DAGScheduler built into SparkContext is responsible for creating jobs, dividing RDD in DAG into different stages, submitting stages and other functions. The built-in TaskScheduler is responsible for resource application, task submission and task scheduling by request cluster.

(2) Storage system:

The memory of each node is preferred for storage. Disks are used only when the memory is insufficient. This greatly reduces disk I/OS and improves task execution efficiency, making Spark suitable for real-time computing and streaming computing. In addition, Spark provides a memory-centric, highly fault-tolerant distributed file system called Tachyon for users to choose from. Tachyon provides Reliable memory-level file sharing services for Spark.

(3) Computing Engine:

The computing engine consists of DAGScheduler in SparkContext, RDD, and Map and Reduce tasks performed by executors on specific nodes. Although DAGScheduler and RDD are located inside SparkContext, RDD in Job is organized into directed acyclic graph (DAG) before the task is formally submitted and executed, and stages are divided to determine the number of tasks, iterative calculation, shuffle and other processes in the task execution Stage.

(4) Deployment mode:

As a single node cannot provide sufficient storage and computing capabilities, Spark, which processes big data, supports the Standalone deployment mode and distributed resource management systems such as Yarn and Mesos in the TaskScheduler component of SparkContext. The Standalone, Yarn, and Mesos deployment modes are used to allocate computing resources for tasks, improving concurrent Task execution efficiency.

2. Spark’s main sub-frameworks include:

(1) the Spark SQL:

SqlParser is used to convert SQL into a syntax Tree, and RuleExecutor is used to apply a series of rules to the syntax Tree. Finally, a physical execution plan is generated and executed. The rule executors include an Analyzer and an Optimizer.

(2) the Spark Streaming:

Used for streaming computing. Spark Streaming supports multiple data input sources such as Kafka, Flume, Twitter, MQTT, ZeroMQ, Kinesis and simple TCP sockets. An input stream Receiver (Receiver) is the interface specification for accessing data streams. Dstream is an abstraction of all data streams in Spark Streaming. Dstream can be organized as a Dstream Graph. Dstream is essentially a series of sequential RDD’s.

(3) the GraphX:

Spark is a distributed graph computing framework. GraphX mainly follows the Pregel model in bulk Synchronous Parallel (BSP) computing mode. GraphX provides an abstract Graph of graphs, which consists of Vertex, Edge and EdgeTriplet that inherits Edge. GraphX currently encapsulates the implementation of shortest path, web page ranking, connection components, trigonometry and other algorithms, and users can choose to use them.

(4) MLlib:

The machine learning framework provided by Spark. Machine learning is an interdisciplinary subject of design probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other fields. MLlib now provides the basic statistics, analysis, regression, decision tree, random forests, naive bayes and isotonic regression, collaborative filtering, clustering, dimension reduction, feature extraction and transformation, frequent pattern mining, the prediction model markup language, pipe and so on many kinds of mathematical statistics, probability theory, mathematical algorithm in data mining.