Apache Spark 3.0 builds on Spark 2.x and brings new ideas and features.

Spark is an open source unified engine for big data processing, data science, machine learning, and data analysis workloads that has grown into one of the most active open source projects since it was first released in 2010; Java, Scala, Python, R, and other languages are supported, and SDKS are provided for these languages.

Spark SQL in Spark 3.0 is the most active component in this release, with 46% of the issues resolved being for Spark SQL, including structured streams and MLlib, and high-level apis, including SQL and DataFrames. After a lot of optimization, Spark 3.0 is about two times faster than Spark 2.4.

Python is by far the most widely used language on Spark; PySpark, which is available for Python, has more than 5 million monthly downloads on PyPI. In Spark 3.0, a number of improvements were made to the functionality and usability of PySpark, including a redesign of the PANDAS UDF API with Python type prompts, the new PANDAS UDF type, and more Pythonic error handling.

Here are some of the highlights of Spark 3.0: adaptive query execution, dynamic partition trims, ANSI SQL compliance, major improvements to the PANDAS API, a new UI for structured streams, a 40-fold increase in the speed of calling R user-defined functions, accelerater-aware schedulers, and SQL reference documents.

These functions can be divided into the following modules:

  • Core, Spark SQL, and Structured Streaming
  • MLlib
  • SparkR
  • GraphX
  • To give upPython 2andR 3.4The following versions are supported;
  • Fix some known problems;

Core, Spark SQL, and Structured Streaming

Prominent feature

  1. Accelerator sensing scheduler;
  2. Adaptive query;
  3. Dynamic partition pruning;
  4. redesignedpandas UDF APIAnd type hints;
  5. Structured flow user interface;
  6. The plug-in directoryAPISupport;
  7. supportJava 11;
  8. supportHadoop 3;
  9. Better compatibilityANSI SQL;

Performance improvement

  1. Adaptive query;
  2. Dynamic partition pruning;
  3. To optimize the9Rules;
  4. Minimize table cache synchronization performance optimization;
  5. Split the aggregate code into small functions;
  6. inINSERTandALTER TABLE ADD PARTITIONAdded batch processing to the command;
  7. Allows aggregators to register asUDAF;

SQL Compatibility enhancement

  1. useProleptic GregorianThe calendar;
  2. To establishSparkOwn date-time schema definition;
  3. Introduce for table insertsANSIStorage allocation strategy;
  4. Follow by default in table insertsANSIStorage allocation rules;
  5. Add aSQLConf:spark.sql.ansi.enabled, used to openANSIMode;
  6. Supports aggregate expressionsANSI SQLFilter clause;
  7. supportANSI SQL OVERLAYFunction;
  8. supportANSINested comments in square brackets;
  9. Throws an exception outside the integer range;
  10. Overflow check of interval arithmetic operation;
  11. Throws an exception when an invalid string is converted to a numeric type;
  12. The overflow behavior of interval multiplication and division is consistent with other operations.
  13. forcharanddecimaladdANSIAlias for the type;
  14. SQLThe parser is definedANSICompatible reserved keywords;
  15. whenANSIWhen the mode is enabled, do not use reserved keywords as identifiers.
  16. supportANSI SQL.LIKE... ESCAPEGrammar;
  17. supportANSI SQLBoolean predicate syntax;

PySpark enhanced version

  1. redesignedpandas UDFsAnd provide type hints;
  2. allowPandas UDFusingpd.DataFramesIterator of;
  3. supportStructTypeAs aScalar Pandas UDFParameter and return type of;
  4. throughPandas UDFssupportDataframe Cogroup;
  5. increasemapInPandasTo allowDataFramesIterator of;
  6. Part of theSQLFunctions should also take data column names;
  7. letPySparktheSQLAbnormal morePythonic;

Scalability enhancement

  1. Directory plug-in;
  2. The data sourceV2 APIRefactoring;
  3. Hive 3.0and3.1Version of metamodel support;
  4. willSparkPlug-in interfaces extend to drivers;
  5. This can be extended by customizing metricsSparkIndex system;
  6. Provides developers for extending column processing supportAPI;
  7. useDSV2Built-in source migration for:parquet, ORC, CSV, JSON, Kafka, Text, Avro;
  8. Allows for theSparkExtensionsInjection function;

Connector enhancement

  1. Supported in data source tablesspark.sql.statistics.fallBackToHdfs;
  2. upgradeApache ORCto1.5.9;
  3. supportCSVData source filter;
  4. Use local data sources to optimize insert partitionsHiveTable;
  5. upgradeKafkatoAgainst 2.4.1;
  6. New built-in binary data sources, new operation-free batch data sources and operation-free stream receivers;

The native Spark application in K8s

  1. useK8SFor more sensitive dynamic allocation, and inK8STo add toSparktheKerberosSupport;
  2. useHadoopCompatible file systems support client dependencies;
  3. ink8sAdd a configurable authentication secret source in the background.
  4. supportK8sSubpath mount;
  5. inPySpark Bindings for K8SIn thePython 3As the default option;

MLib

  1. forBinarizer,StringIndexerStopWordsRemoverPySpark QuantileDiscretizerAdded support for multiple columns;
  2. Support tree-based feature transformation;
  3. Two new estimators have been addedMultilabelClassificationEvaluatorandRankingEvaluator;
  4. increasedPowerIterationClusteringtheR API;
  5. Added a function to track the state of ML pipesSpark MLThe listener;
  6. inPythonA fitting with a verification set was added to the gradient ascending tree in.
  7. increasedRobustScalerTransformer;
  8. Factorized machine classifier and regressor are added.
  9. Gauss Naif Bayes and complementary Naif Bayes are added.

In addition, the Spark 3.0, Pyspark multiple classes in logistic regression returns LogisticRegressionSummary now, rather than its subclasses BinaryLogisticRegressionSummary; Pyspark. Ml. Param. Shared. From the * mixins are no longer provide any set (self, value) setter methods, but use, is their own self. The set (self., value) instead.

SparkR

The interoperability of SparkR is optimized by vectorizing R gapply(), dapply(), createDataFrame, collect() to improve performance.

There’s the eager Execution R shell, the IDE, and the R API for iterative clustering.

Deprecated components

  1. deprecatedPython 2Support;
  2. deprecatedR 3.4The following versions are supported;
  3. deprecatedDeprecate UserDefinedAggregateFunction;

Spark 3.0 is also a big release, with a number of new features, fixes for known issues and significant performance improvements.

Since Python officially announced that it would stop maintaining Python2, various components have also responded by discontinuing Python support. Those who are studying Python in the project can also consider learning Python 3 directly.

Although the old man is not serious, but the old man a talent! Follow me for more knowledge of programming technology.