Spark RDD conversion operator summary

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Difference between Map and mapPartitions

Data processing perspective

A Map operator is a data execution within a partition, similar to a serial operation.

MapPartitions operators perform batch processing on a partition basis.
Function of Angle

Map operator is similar to serial operation, so its performance is low.

MapPartitions operators are similar to batch processing and therefore have high performance.

However, mapPartitions operators occupy memory for a long time, which may lead to insufficient memory and memory overflow errors. Therefore, the map operation is not recommended when the memory is limited.

groupBy

Data is grouped according to specified rules. Partitions remain unchanged by default, but data is shuffled and regrouped.
In the extreme case, data may be grouped in the same partition, with data from one group in one partition, but this does not mean that there is only one group in one partition.

filter

Data is filtered according to the specified rules. Data that meets the rules is retained and data that does not meet the rules is discarded.
After data is filtered, partitions remain the same, but data in partitions may be unbalanced. In the production environment, data skew may occur.

sample

Extract data from a data set according to the specified rules:

Extract data without putting it back (Bernoulli algorithm)

Bernoulli algorithm: also known as 0, 1 distribution. Like flipping a coin, either heads or tails. Specific implementation: according to the seed and random algorithm to calculate a number and the second parameter setting probability comparison, less than the second parameter, greater than the first parameter: whether to put back the extracted data, false: do not put back the second parameter: extraction probability, the range is between [0,1],0: do not take all; 1. Third parameter: random number seedCopy the code

2. Extract data and put it back (Poisson algorithm)

The first parameter: whether the extracted data should be put back, true: put back; False: the second parameter is the probability of duplicate data. The value is greater than or equal to 0. The number of times each element is expected to be extractedCopy the code

coalesce

Partition reduction based on data volume is used to improve the execution efficiency of small data sets after filtering large data sets.
If spark has too many small tasks, you can use the coalesce method to reduce the number of coalesce partitions and reduce the task scheduling cost.

repartition

The coalesce operation is performed internally. The default value of shuffle is true.
The repartition operation can be completed either by converting an RDD with a large number of partitions to an RDD with a small number of partitions or by converting an RDD with a large number of partitions to an RDD with a large number of partitions.

Intersection, union, and subtract

The two RDD data types must be consistent

Zip (zipper)

Two RDD data types can be inconsistent
The two RDD data partitions must be consistent

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
Copy the code

The two RDD partitions must have the same amount of data

org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition
Copy the code

partitionBy

Repartitions the data according to the specified Partitioner. The default Spark partitioner is the HashPartitioner.

If the partitioner repartitioned is the same as the partitioner of the current RDD, the bottom layer checks whether the partitioner’s Equals is the same and the number of partitions is the same. If both are equal, the current partition is returned without any processing.
Spark az:
- Hash Partitioner
- Range Partitioner
Custom partition:

To implement custom partitioning, you need to inherit the Partitioner class and implement the following methods:
- NumPartitions: returns the number of partitions
- GetPartition: returns the ID of the partition where the key is executed
- Equals: standard method for checking equality. Spark uses this method to check whether the partition examples of RDD are the same and determine whether to perform shuffle.

ReduceByKey and groupByKey

From the shuffle Angle:

ReduceByKey and groupByKey both contain the operation of shuffle, but reduceByKey can pre-combine the data of the same key in partitions before shuffle, which will reduce the amount of data dropped in disks. However, groupByKey is only used for grouping, and there is no problem of reducing the amount of data. Therefore, reduceByKey has high performance.
From a functional point of view:

ReduceByKey actually contains grouping and aggregation functions. GroupByKey can only be grouped but cannot be aggregated. Therefore, it is recommended to use reduceByKey in group aggregation. If only grouping without aggregation is required, only GroupByKey can be used.

Differences between reduceByKey, foldByKey, aggregateByKey, and combineByKey

ReduceByKey: No calculation is performed on the first data of the same key, and the calculation rules are the same within and between partitions.
FoldByKey: The first data and the initial value of the same key are calculated within a partition. The calculation rules are the same within and between partitions.
AggregateByKey: The first data and initial value of the same key are calculated within a zone. The calculation rules within and between zones may be different.
CombineByKey: When the data structure does not meet the requirements during calculation, the first data can be converted to the structure. The calculation rules are different between and within partitions.

        reduceByKey:

             combineByKeyWithClassTag[V](
                 (v: V) => v, // The first value does not participate in the calculation
                 func, // Calculate rules within partitions
                 func, // Partition calculation rules
                 )

        aggregateByKey :

            combineByKeyWithClassTag[U](
                (v: V) => cleanedSeqOp(createZero(), v), // The initial value and the value of the first key are the intra-partition data operations
                cleanedSeqOp, // Calculate rules within partitions
                combOp,       // Partition calculation rules
                )

        foldByKey:

            combineByKeyWithClassTag[V](
                (v: V) => cleanedFunc(createZero(), v), // The initial value and the value of the first key are the intra-partition data operations
                cleanedFunc,  // Calculate rules within partitions
                cleanedFunc,  // Partition calculation rules
                )

        combineByKey :

            combineByKeyWithClassTag(
                createCombiner,  // Handle the first data of the same key
                mergeValue,      // Represents the data processing function within the partition
                mergeCombiners,  // Represents the data processing function between partitions
                )

Copy the code

join

Only data with equal keys is processed.