This article was originally published by AI Frontier.
Spark 2.3 blockbuster release: To compete with Flink, introduce continuous stream processing


Planning Editor | Natalie


Author | Sameer Agarwal, Xiao Li, Reynold Xin, Jules Damji


Translator | Xue Mingdeng

On February 28, 2018, Databricks released Apache Spark 2.3.0 on the official engineering blog as part of the Databricks Runtime 4.0 beta. The new version introduces a continuous streaming model that reduces streaming latency to the millisecond level and is said to be PK Flink’s killer. What other important updates are there, and if Spark deserves a new level up, you’ll find out!”

Spark 2.3 continues to move toward faster, more usable, and smarter Structured Streaming by introducing low-latency continuous processing capabilities and stream-to-stream connections; Use Pandas UDF to improve PySpark performance. Kubernetes native support for Spark applications.

In addition to continuing to introduce new features in SparkR, Python, MLlib, and GraphX, this release focused on usability and stability, resolving over 1400 tickets. Other major features are as follows:

  • DataSource V2 API
  • Vectorized ORC Reader
  • Spark History Server V2 contains the key value store
  • Machine learning pipeline API based on Structured Streaming
  • MLlib enhancement
  • The Spark SQL enhancement

Below will simply summarized some of the major features and improvements, more information can be see Spark 2.3 release notice (spark.apache.org/releases/sp…). .


Continuous streaming processing at the millisecond level

For some reason, The Structured Streaming introduced in Spark 2.0 decouples microbatch processing from high-level apis. First, it simplifies the use of the API, which is no longer responsible for microbatch processing. Second, a developer can treat a stream as a table with no boundaries and run queries based on these “tables.”

However, Spark 2.3 has introduced a continuous streaming mode with millisecond latency in order to provide more streaming experience for developers.

Internally, the Structured Streaming engine executes queries in microbatch increments, at time-dependent intervals, but such delays are acceptable for real-world Streaming applications.

In continuous mode, the stream processor continuously pulls and processes data from the data source, rather than reading batches of data every once in a while, so that newly arrived data can be processed in a timely manner. As shown in the figure below, the latency is reduced to the millisecond level, fully satisfying the requirement of low latency.

The Dataset operations currently supported in persistent mode include Projection, Selection, and SQL operations other than current_timestamp(), current_date(), and aggregate functions. It also supports Kafka as a data source and data pool (Sink), as well as console and memory as data pools.

Structured Streaming allows developers to choose between continuous mode and microbatch mode depending on their actual latency requirements. In short, Structured Streaming provides fault tolerance and reliability.

In short, what Spark 2.3’s continuous mode does is:

  • End to end millisecond delay
  • At least one processing guarantee
  • Supports the mapping operation of the Dataset


Flow to flow connections

Spark 2.0’s Structured Streaming already supports joins for DataFrame/Dataset, but only streams to static datasets, while Spark 2.3 brings the long-awaited stream-to-stream join for both inside and out, Can be used in a large number of real-time scenarios.

AD realization is a typical application scenario for stream-to-stream connections. For example, the Advertising Impression stream and the user click stream contain the same key (such as ADLD) and relevant data, and you need to do a streaming analysis based on that data to find out which user clicks are related to ADLD.

While seemingly simple, the actual connection to the stream solves some technical challenges:

  • Late data is buffered until a match is found in another stream.
  • Prevent the buffer from overswelling by setting the Watermark.
  • Users can make tradeoffs between resource consumption and latency.
  • The SQL syntax is consistent between static and stream connections.


The Spark and Kubernetes

The combination of capabilities between Spark and Kubernetes, two open source projects, is also expected to provide large-scale distributed data processing and orcheation. In Spark 2.3, users can run Spark natively on a Kubernetes cluster to make better use of resources, and different workloads can share the Kubernetes cluster.

Spark can use all the management features of Kubernetes, such as resource quotas, pluggable authorization, and logging. In addition, starting the Spark workload on an existing Kubernetes cluster is as simple as creating a Docker image.


Pandas UDF for SySpark

Pandas UDF, also known as the vectorized UDF, brings significant performance improvements to PySpark. Pandas UDF is developed entirely in Python based on Apache Arrow and can be used to define low-overhead, high-performance UDFs.

Spark 2.3 provides two types of Pandas UDFs: scalar and combined Map. From Two of the Sigma Li Jin before a blog (databricks.com/blog/2017/1…). Using Pandas UDF is described in four examples.

Some benchmarks show that the Pandas UDF is an order of magnitude better in performance than the row-based UDF.

Some contributors, including Li Jin, plan to introduce aggregation and windowing capabilities in Pandas UDF.


Improvements in MLlib

Spark 2.3 brings a number of MLlib improvements, including algorithms, features, performance, scalability, and availability.

First, MLlib models and pipes can be deployed to a production environment through a Structured Streaming job, although some existing pipes may need to be modified.

Second, to meet the needs of deep learning image analysis, Spark 2.3 introduces ImageSchema, which represents images as Spark DataFrame and provides tools for loading commonly used image formats.

Finally, Spark 2.3 brings improved Python apis for developing custom algorithms, including UnaryTransformer and automated tools for saving and loading algorithms.

Original link:

Databricks.com/blog/2018/0…

Introducing Apache Spark 2.3 – The Databricks Blog

Databricks.com/blog/2018/0…

For more content, you can follow AI Front, ID: AI-front, reply “AI”, “TF”, “big Data” to get AI Front series PDF mini-book and skill Map.