A list,

Spark was born at AMPLab at the University of California, Berkeley in 2009. It was donated to the Apache Software Foundation in 2013 and became an Apache Top-level project in February 2014. Spark delivers hundreds of times more performance than MapReduce’s batch computing, making it the most widely used distributed computing framework after MapReduce.

Second, the characteristics of

Apache Spark has the following features:

  • Use advanced DAG scheduler, query optimizer and physical execution engine to achieve performance assurance;
  • Multi-language support, currently supported by Java, Scala, Python and R;
  • More than 80 advanced apis are provided to make it easy to build applications;
  • Support batch processing, stream processing and complex business analysis;
  • Rich class library support, including SQL, MLlib, GraphX and Spark Streaming libraries, and you can combine them seamlessly.
  • Rich deployment modes: support local mode and its own cluster mode, also support Hadoop, Mesos, Kubernetes run;
  • Multi-data source support: Access HDFS, Alluxio, Cassandra, HBase, Hive, and hundreds of other data sources.

Third, cluster architecture

Term (Term) Meaning (Meaning)
Application The Spark application consists of a Driver node and multiple Executor nodes on a cluster.
Driver program The main application, which runs the application’s main() method and creates SparkContext
Cluster manager Cluster resource Manager (for example, Standlone Manager, Mesos, YARN)
Worker node The work node that performs computing tasks
Executor An application process on a worker node that performs computations and saves output data to memory or disk
Task Unit of work sent to Executor

Execution process:

  1. After the SparkContext is created by the user program, it connects to the cluster resource manager, which allocates computing resources to the user program and starts Executor.
  2. A Dirver divides an application into execution stages and tasks, which are then sent to Executor.
  3. Executor is responsible for executing tasks and reporting the execution status to the Driver. The Executor also reports the current node resource usage to the cluster resource manager.

Core components

Based on the Spark Core, Spark extends four Core components to meet computing requirements in different fields.

3.1 Spark SQL

Spark SQL is used to process structured data. It has the following characteristics:

  • Seamlessly blends SQL queries with Spark applications, allowing you to query structured data using SQL or DataFrame apis;
  • Support for multiple data sources including Hive, Avro, Parquet, ORC, JSON, and JDBC;
  • Support HiveQL syntax and user-defined functions (UDF), allowing you to access existing Hive repositories;
  • Support for standard JDBC and ODBC connections;
  • Support for optimizer, column storage, and code generation features to improve query efficiency.

3.2 Spark Streaming

Spark Streaming is primarily used to quickly build scalable, high-throughput, and fault-tolerant stream handlers. Support reading and processing data from HDFS, Flume, Kafka, Twitter and ZeroMQ.

The essence of Spark Streaming is microbatch processing, which splits data streams into multiple batches with minimal granularity to achieve an effect close to stream processing.

3.3 MLlib

MLlib is Spark’s machine learning library. It is designed to make machine learning simple and scalable. It provides the following tools:

  • Common machine learning algorithms such as classification, regression, clustering and collaborative filtering;
  • Characterization: feature extraction, transformation, reduction and selection;
  • Pipes: Tools for building, evaluating, and adjusting ML pipes;
  • Persistence: Save and load algorithms, models, and pipeline data
  • Practical tools: linear algebra, statistics, data processing, etc.

3.4 Graphx

GraphX is a new component of Spark for graphics computing and parallel graphics computing. At a high level, GraphX extends RDD by introducing a new graph abstraction (a directional multiple graph with attributes attached to each vertex and edge). To support graph computation, GraphX provides a set of basic operators (such as SubGraph, joinVertices, and aggregateMessages) and an optimized Pregel API. In addition, GraphX includes a growing number of graphics algorithms and builders to simplify graph analysis tasks.

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series