This is the 22nd day of my participation in the More text Challenge. For details, see more text Challenge

Operator tuning 4: Filter and coalesce are used together

In the Spark task, the filter operator is often used to filter the RDD data. At the initial stage of the task, the data volume loaded from each partition is similar. However, after the filter is passed, the data volume of each partition may vary greatly, as shown in the following figure:

  • According to the information in the figure, we can find two problems:
    • The data quantity of each partition becomes smaller. If the current data is processed according to the number of tasks equal to that of the partition, the computing resources of the task will be wasted.

    • The data volume of each partition is different. As a result, each subsequent task must process different data volumes for each partition, which may cause data skewness.

As shown in the above, the article 100 after the second partition of data filtering, and the third partition of data filtering after the remaining 800, under the same processing logic, the second partition corresponding to the task of dealing with the amount of data and the third partition task processing reached eight times the amount of data that gap, it will also lead to speed several times there may be a gap, This is the data skewing problem.

  • For the above two problems, we analyze them respectively:

    • In view of the first problem, since the data quantity of the partition is smaller, we hope to redistribute the partition data, for example, the data of the original 4 partitions is converted to 2 partitions, so that only the following two tasks are needed for processing, avoiding the waste of resources.

    • For the second problem, the solution is very similar to the solution for the first problem, which is to reallocate the partition data so that each partition has about the same amount of data, which avoids the data skewing problem.

So how to realize the above solution? We need the coalesce operator.

Both repartition and coalesce can be used for repartition. Repartition is a simple implementation when shuffle is set to true on the coalesce interface. Shuffle is not implemented in a coalesce interface by default, but can be configured using parameters.

  • Let’s say we want to change the original number of partitions A to B by repartitioning.
    • A > B (most partitions merge into few partitions)

      1. If the difference between A and B is small, coalesce is used instead of shuffle.
      2. If the difference between A and B is large, coalesce can be used and the shuffle process is disabled. However, the coalesce performance is poor. Therefore, it is recommended to set the second parameter of coalesce to true to start the shuffle process.
    • A < B

      1. In this case, repartition is used. If coalesce is used, set shuffle to true. Otherwise, coalesce is invalid. After the filter operation, you can use the coalesce operator to compress the number of partitions and ensure that the data amount of each partition is as uniform and compact as possible to facilitate subsequent tasks. To some extent, it can improve performance to some extent.

Note: The local mode simulates the cluster running in the process. There is already some internal optimization for parallelism and partitioning, so you don’t need to set the parallelism and partitioning.

Operator tuning 5: reduceByKey preaggregation

Compared with the ordinary shuffle operation, reduceByKey has a significant feature that it can perform local aggregation on the map side. The Map side will first combine the local data and then write the data to the file created by each task in the next stage. In other words, on the map side, Perform the reduceByKey operator function for the value corresponding to each key. The execution process of reduceByKey operator is shown in the figure:

  • The performance improvement of reduceByKey is as follows:
    • After local aggregation, the amount of data on the MAP is reduced, which reduces disk I/O and disk space usage.
    • After local aggregation, less data is pulled at the next stage, reducing the amount of data transmitted over the network.
    • After local aggregation, the memory usage of data cache on the Reduce end is reduced.
    • After local aggregation, the amount of data aggregated on the Reduce end is reduced.

Based on the local aggregation feature of reduceByKey, we should consider using reduceByKey to replace other shuffle operators, such as groupByKey. The operating principle of reduceByKey and groupByKey is shown in the figure:

GroupByKey principle

ReduceByKey principle

As shown in the preceding figure, groupByKey does not perform aggregation on the Map side. Instead, all data on the Map side is shuffed to the Reduce side and aggregated on the Reduce side. ReduceByKey has the map-end aggregation feature, which reduces the amount of data transmitted on the network. Therefore, the efficiency of reduceByKey is significantly higher than groupByKey.