“This is the seventh day of my participation in the First Challenge 2022.


Hello everyone, I am a ~

In the first hour, we will learn the following:

  • What is Spark?
  • How is Spark related to Hadoop?
  • What are the advantages of Spark?
  • What is Spark good for?
  • The core module of Spark
  • System architecture of Spark

This section will help you understand some of the core concepts of Spark in simple and simple language for future application.

What is Spark?

Let’s take a look at the official website.

What is Apache Spark ?

What is Apache Spark?

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Apache Spark is a multi-language engine for performing data engineering, data science, and machine learning on a single node machine or cluster.

To put it simply, Spark is a big data computing framework. It is an engine that quickly processes and calculates big data based on memory. It supports multiple language apis, can be single click or cluster deployment, and provides libraries for data analysis and machine learning.

How is Spark related to Hadoop?

Hadoop is just a set of tools, can also be said to be a big data ecosystem. It consists of HDFS, Yarn, and MapReduce, which are used for distributed file storage, resource scheduling, and computing.

This is the early big data processing solution. Yarn is used to schedule resources and read files stored in HDFS for MapReduce calculation.

It needs to be implemented in Java code, which is difficult to write, so we have Hive as SQL parsing and optimizer for Hadoop, write a SQL, parse into Java code, and then execute.

Is that perfect?

MapReduce simply processes data separately and then merges statistics. It needs to write and read files frequently, which makes it run very slowly. At this time, Spark, a memory-based alternative to MapReduce, is introduced to improve speed.

What are the advantages of Spark?

The popularity of a technology or framework is bound to solve some difficult problems and has its unique advantages. We should know how to use it and why to use it.

  • fast

    As mentioned earlier, Spark is memory-based, which is necessarily faster than MapReduce, which reads and writes files frequently.

  • Easy to use

    As officially stated, Spark supports Apis in Java, Python, Scala, Sql, and R, as well as over 80 advanced algorithms.

  • Functions are all the

    Spark has a full range of functions, including batch processing, interactive query, stream processing, and machine learning. These features do not affect Spark’s performance.

  • Easy integration

    Spark is easy to integrate with other open source products.

What is Spark good for?

Spark is the king of data computing and processing, supporting PB-level processing in real-time and offline scenarios. Typical use is as follows:

  • Stream processing of log files and sensor data.
  • Machine learning
  • Data analysts do interactive analysis
  • Data integration and cleaning between systems

The core module of Spark

  • Spark Core

It provides the most basic and core functions of Spark and is the basis for the following function extensions.

  • Spark SQL

Components that can manipulate structured data using SQL.

  • Spark Streaming

The Spark platform provides rich apis for streaming computing of real-time data.

  • Spark MLlib

Spark provides a machine learning algorithm library that provides additional functions such as model evaluation and data import. It also provides some lower-level machine learning primitives that are difficult to learn.

  • Spark GraphX

Spark provides a framework and algorithm library for graph-oriented computing.

System architecture of Spark

On the first day, we will introduce the system architecture.

We’ll talk about it in detail in the next few days, step by step.

Spark architecture uses the master-slave model of distributed computing. Master is the node that contains the Master process in the cluster, and Slave is the node that contains the Worker process in the cluster. As shown in figure:

  • ClusterManager: In Standalone mode, the Master (Master node) controls the cluster and monitors workers. Resource manager in YARN mode.

  • Worker: Slave node that controls the compute node and starts Executor or Driver. In YARN mode, NodeManager is used to control compute nodes

  • Driver: Runs the main () function of the Application and creates the SparkContext.

  • Executor: An Executor, a component that performs tasks on worker nodes and is used to start a thread pool to run tasks. Each Application has its own set of Executors.

  • SparkContext: The entire application context, which controls the application life cycle.

The last

Happy time is always short, so fast an hour passed, the first day of knowledge are learning waste? It doesn’t matter if you are vaguely familiar with it. Tomorrow, I will take you to the battlefield, Spark.