On November 8th, 2019, Xingbo Jiang sent an email to the community to announce the official release of The Apache Spark 3.0 preview. This release is mainly for the large-scale community test of the upcoming Apache Spark 3.0 release. This preview is not a stable release, either in terms of API or functionality, and its main purpose is to give the community an early taste of the new features of Apache Spark 3.0. If you want to test this version, you can download it here.

Apache Spark 3.0 adds many exciting new features, Including Dynamic Partition Pruning, Adaptive Query Execution, accelerator-aware Scheduling, support Catalog Data Source API with Catalog Supports, Vectorization in SparkR, Hadoop 3/JDK 11/Scala 2.12, etc. For a complete list of the major features and changes in Spark 3.0.0-Preview, see here. Here I’ll walk you through some of the more important new features.

PS: If you look closely, Spark 3.0 doesn’t seem to have many issues related to Streaming/Structed Streaming. There may be several reasons:

  • At present, Spark Streaming/Structed Streaming based on Batch mode can meet most requirements of enterprises, and there are few applications that really need real-time computing, so the Continuous Processing module is still in the experimental stage. Not in a hurry to graduate;
  • The bricks should be investing heavily in Delta Lake related development, which brings in revenue for the business and is currently their main focus, so naturally less in Streaming development. Without further ado, let’s take a look at the new features of Spark 3.0.

Dynamic Partition Pruning

The so-called dynamic partition clipping is to further partition clipping based on the information inferred from the run time. For example, we have the following query:

SELECT * FROM dim_iteblog JOIN fact_iteblog ON (dim_iteblog.partcol = fact_iteblog.partcol) WHERE dim_iteblog.othercol >  10Copy the code

Assume that dimiteblog table dimiteblog.othercol > 10 filters out less data, but because previous versions of Spark could not calculate costs dynamically, the FactiteBlog table may scan a lot of invalid data. With dynamic partitioning trimming, you can filter out useless data from the FactiteBlog table at run time. After this optimization, the query scan data is greatly reduced and the performance is improved by 33 times.

For the corresponding ISSUE of this feature, see Spark-11150 and Spark-28888. The past Memory Big data public account also introduced this feature in detail a few days ago, For details, see Apache Spark 3.0 Dynamic Partition Pruning introduction and Using Apache Spark 3.0 Dynamic Partition Pruning.

Adaptive Query Execution

Adaptive Query Execution (also known as Adaptive Query Optimisation or Adaptive Optimisation) is an optimization of Query execution plans that allows Spark Planner to execute optional execution plans at runtime, These plans will be optimized based on runtime statistics.

As early as 2015, the Spark community proposed the basic idea of adaptive execution, adding an interface to submit a single Map stage in Spark’s DAGScheduler and making an attempt to adjust the number of Shuffle partitions at runtime. However, this implementation has some limitations. In some scenarios, more shuffle, that is, more stages, will be introduced, and the situation that three tables join in the same stage cannot be handled well. It is also difficult with the current framework to have the flexibility to implement other functions in adaptive execution, such as changing execution plans or handling skewed joins at run time. So the feature is still experimental and the configuration parameters are not mentioned in the official documentation. This idea comes from Intel and Baidu. For details, see Spark-9850. For details, see Apache SPARK SQL Adaptive Execution Practices.

The Adaptive Query Execution of Apache Spark 3.0 is based on spark-9850. For details, see Spark-23128. The goal of Spark-23128 is to implement a flexible framework to perform adaptive execution in SPARK SQL and support changing the reducer number at run time. The new implementation addresses all of the limitations discussed earlier, and other features, such as changing join policies and handling skewed joins, will be implemented as separate features and provided as plug-ins in later versions.

Accelerator-aware Scheduling

Nowadays, big data and machine learning have been combined to a large extent. In machine learning, developers generally choose to use Gpus, FPGas or TPU to speed up the computation because the computation iteration time may be very long. Native GPU and FPGA support has been built into Apache Hadoop 3.1. Spark is certainly not far behind as a general-purpose computing engine. Engineers from Databricks, NVIDIA, Google and Alibaba are adding native GPU scheduling support to Apache Spark. This solution fills the gap in Spark’s GPU resource task scheduling, organically integrates big data processing and AI applications, and expands Spark’s application scenarios in deep learning, signal processing, and big data applications. The issues of this work can be found in Spark-24615, and the relevant Documents of SPIP (SPARK Project Improvement Proposals) can be found in SPIP: Accelerator-Aware Scheduling

Apache Spark 3.0 provides built-in GPU scheduling support

YARN and Kubernetes, resource managers supported by Apache Spark, support Gpus. In order for Spark to also support GPUs, there are two major technical changes that need to be made:

At the Cluster Manager level, Cluster Managers need to be upgraded to support gpus. In addition, apis are provided for users to control the use and allocation of GPU resources. Within Spark, changes need to be made at the Scheduler level so that the scheduler can identify GPU requirements in user task requests and then allocate them based on the GPU supply on the executor. Since getting Apache Spark to support gpus was a large feature, the project was divided into several phases. In Apache Spark 3.0, gpus are supported on standalone, YARN, and Kubernetes resource managers with little impact on existing normal jobs. TPU support, GPU support in Mesos Explorer, and GPU support for The Windows platform will not be the goal of this release. Also, fine-grained scheduling within a GPU card is not supported in this version; Apache Spark 3.0 uses a GPU card and its memory as an inseparable unit. For details, see Apache Spark 3.0 will provide Built-in GPU scheduling support.

Apache Spark DataSource V2

Data Source API The Data Source API defines how to read and write Data from storage systems, such as InputFormat/OutputFormat of Hadoop and Serde of Hive. These apis are ideal for using RDD programming in Spark. Programming with these apis solved our problem, but it was expensive for users to use, and Spark couldn’t optimize it. In order to solve these problems, Spark 1.3 introduced the Data Source API V1, through which we can easily read Data from various sources, and Spark uses some optimization engines of SQL components to optimize Data Source reading. Such as column clipping, filtering push-down, and so on.

The Data Source API V1 abstracts a series of interfaces for us that can be used to implement most scenarios. However, as the number of users increases, some problems gradually emerge:

  • Some interfaces rely on SQLContext and DataFrame
  • The expansion capacity is limited and it is difficult to push down other operators
  • Lack of support for column store reads
  • Lack of partitioning and sorting information
  • Write operations do not support transactions
  • Stream processing is not supported

In order to solve some problems with Data Source V1, starting with Apache Spark 2.3.0, the community introduced the Data Source API V2. In addition to retaining the original functions, it also solved some problems with Data Source API V1. Such as no longer dependent on the upper API, the ability to extend. For the Data Source API V2 ISSUE, see Spark-15689. This feature has been available since The Spark 2.x release, but it is not very stable. Therefore, the community has two issues about the stability of the Spark DataSource API V2 and the new feature: Spark-25186 and Spark-22386. The Spark DataSource API V2 final stable release and new features will be released later this year along with Apache Spark 3.0.0, which is a major new feature of the Spark 3.0.0 release.

For more details about the Apache Spark DataSource V2, see The Apache Spark DataSource V2 Introduction and Getting Started Programming Guide of the Past Big Data official account and The Apache Spark DataSource V2 Introduction and introductory programming guide (below) the introduction of two articles.

Better ANSI SQL compatibility

PostgreSQL is one of the most advanced open source databases. It supports most of the major features of SQL:2011. PostgreSQL complies with at least 160 of the 179 features required by SQL:2011. The Spark community has an ISSUE spark-27764 to address the differences between Spark SQL and PostgreSQL, including feature completion and Bug fixes. Functionality enhancements include functions that support ANSI SQL, distinguish SQL reserved keywords, and built-in functions. This ISSUE corresponds to 231 sub-issues, and if all issues in this section are resolved, the difference between Spark SQL and PostgreSQL or ANSI SQL:2011 is even smaller.

SparkR vectorized read and write

Spark has supported R since version 1.4, but the architecture of Spark interacting with R at that time is shown as follows:

Whenever we use R to interact with the Spark cluster, we need to go through the JVM, which makes it impossible to avoid data serialization and deserialization operations, which are very poor performance in the case of large amounts of data!

Additionally, Apache Spark has implemented vectorization optimization in many operations, such as internal Columnar format, Parquet/ORC vectorization reading, Pandas UDFs, and more. Vectorization can greatly improve performance. SparkR vectorization allows users to use existing code as-is, but improves performance by approximately thousands of times when they execute R native functions or convert Spark DataFrame to R DataFrame. The work can be seen at Spark-26759. The new architecture is as follows:It can be seen that SparkR vectorization makes use of Apache Arrow, which makes the data interaction between systems very efficient and avoids the consumption of data serialization and deserialization. Therefore, the performance of SparkR and Spark interaction is greatly improved after adopting this method.