background

Lao Liu will brush the big data development experience of Niuke.com in the evening recently, and will always see a high-frequency interview question, that is, what problems have you encountered in the learning process?

This is a difficult question to answer. If I’m too simple, will it make the interviewer think I’m too low? What should I say? I a self-taught impossible encounter what advanced problem!

For the answer to this question online is also controversial, Lao Liu also talk about this problem, share their own views, welcome to come to the battle!

                                                

process

In the process of finding the answer to this question, Liu happened to be learning SparkStreaming, a real-time computing module of the Spark framework, which has a very classic question about the conjecture mechanism!

What is the inference mechanism?

If there are many tasks running, many tasks complete their tasks at once, but one task is running very slowly. In real time computing tasks, if the requirements for real time are high, even two or three seconds should also care about this.

So there is a speculating mechanism in SparkStreaming specifically for this slow running task.

If the total number of running tasks is 10, the number of successfully running tasks is greater than 0.75×10, and the running time of running tasks is greater than the average time of successfully running tasks in 1.5x, the running tasks need to be rescheduled.

However, there is a very serious problem here, I found it when I first studied by myself, and then I saw some videos of institutions which also mentioned this problem, indicating that Lao Liu’s consciousness is slowly improving in the process of self-study.

                                                

The question is what if the running task encounters data skew?

If there are 5 tasks and one task encounters data skew, but even if it encounters data skew (a little bit of data skew is fine), it will still complete the task, it takes 6s and the other 4 tasks only take 1s. After the speculation mechanism is turned on, the task finally runs to 2s and is almost successful. But when it encounters the speculation mechanism, it needs to be rescheduled and run again. Next time it runs 3s, it will run again when it encounters the speculation mechanism.

The teacher in the video of a training institution said that this question was ok. Lao Liu himself also thought of this shortcoming of the speculation mechanism, so he shared it with everyone!

To solve

What happens when you turn on the projection and you get skew?

We can adopt some solutions to data skew, and Lao Liu talked about several solutions to data skew roughly:

1. If it is found that there are only a few keys causing data skew, and the impact on the calculation itself is not significant, we can adopt filtering of a few keys causing data skew

2. Two-stage aggregation: The same key can be changed into multiple different keys by attaching random prefixes, so that the data processed by one task can be dispersed to multiple tasks for local aggregation, thus solving the problem of excessive data processed by a single task. Then remove the random prefix and do the global aggregation again to get the final result. However, this method is only applicable to shuffle operations of the aggregate class, not shuffle operations of the Join class.

3. As for the data skew caused by join, if only a few keys cause the skew, a few keys can be divided into independent RDD and broken into N pieces with random prefixes for join. In this case, the data corresponding to these keys will not be concentrated in a few tasks, but dispersed to multiple tasks for join. This method is applicable to join two tables with large amounts of data.

4. If there are a large number of keys in THE RDD during join operation, resulting in data skew, it is meaningless to split the keys, and this solution is the only solution to solve the problem. Instead of having one task handle a large number of the same keys, you can divide the processed “different keys” into multiple tasks by adding random prefixes to the same keys.

                                                      

Well, the SparkStreaming speculation mechanism is over and you can answer the interviewer later on. If you have any questions, you can contact the official account: hardworking Old Liu, welcome to come and battle with old Liu!