Solving Data Skew with Spark (3)

This is the 15th day of my participation in the More text Challenge. For more details, see more text Challenge

Solution 6: Dual aggregation using random keys

When operators such as groupByKey and reduceByKey are used, random keys can be considered to achieve double aggregation, as shown in the figure:

Firstly, the map operator is used to add a prefix of random number to the key of each data, and then the key is scattered. The original same key is changed into different keys, and then the first aggregation is carried out. In this way, the data originally processed by one task can be dispersed to multiple tasks for local aggregation. Each key is then prefixed and aggregated again.
This method has a good effect on data skewing caused by groupByKey and reduceByKey operators, and is only applicable to shuffle operation of the aggregation class. The scope of application is relatively narrow. If it’s a shuffle operation of the join class, another solution has to be used.
This method is also the solution to try when the previous several schemes do not have better results.

Solution 7: Join using random number expansion

If a large number of keys in the RDD cause data skews during the join operation, it is meaningless to split the keys. In this case, only the last solution can be used to solve the problem. For the JOIN operation, we can consider expanding the data in one RDD and diluting the other RDD before joining.
We will change the original same key into different keys by attaching random prefixes, and then we can spread these “different keys” to multiple tasks for processing, rather than having one task process a large number of the same keys. In this case, there are a large number of tilted keys. Therefore, some keys cannot be separated for separate processing. In this case, the data of the entire RDD needs to be expanded, which requires high memory resources.

Core ideas:
- Select an RDD and use flatMap to expand the capacity. Add a numeric prefix (ranging from 1 to N) to each data key to map one data to multiple data. (capacity)
- Select another RDD to map. Each data key is prefixed with a random number (1 to N). (dilution)
- Join the two processed RDD.

Limitations:

If both RDD’s are large, expanding the RDD by a factor of N will not work. Using capacity expansion can only alleviate data skewness, but cannot completely solve the problem. Further optimization analysis of scheme 7 and Scheme 6:

When several keys in the RDD cause data skewyness, Scheme 6 is no longer applicable, and Scheme 7 consumes a lot of resources. In this case, the ideas of Scheme 7 can be introduced to improve Scheme 6:
- For the RDD that contains a few keys with a large amount of data, sample a sample using the sample operator, and then count the number of each key to figure out which keys have the largest amount of data.
- Then, separate the data corresponding to the keys from the original RDD to form a single RDD, and prefix each key with a random number within N. This does not cause most skewed keys to form another RDD.
- Then, the other RDD that needs to be joined filters out the data corresponding to the skew keys and forms a single RDD. Each data is expanded to n data items. The n data items are prefixed with 0 to N in order.
- Then join the independent RDD with a random prefix and another independent RDD with an expansion of n times. In this case, the original same key can be broken into N copies and dispersed to multiple tasks for join.
- Join the other two common RDD as usual.
- Finally, the results of the two joins can be combined with the union operator, which is the final join result.

Solution 6: Dual aggregation using random keys

Solution 7: Join using random number expansion

Related Posts

How to expand Kubernetes storage without downtime

Github – Webhook tool to implement Github automatic build

Docker Swarm common commands