Background on pit

The process is as follows:1. Obtain a DataFrame using SparkSQL; 2. Then map the DataFrame and call the GET interface to obtain IDs as a new DatdaFrame. 3. Map the DataFrame, invoke the Post interface in the map, and put the final result into the interface.

Hit the pit reason

Because of lazy, so when the spark – submit script directly copy another spark assignments submitted scripts, but never expected that this script, there is a set: – the conf “spark. Speculation = true”, didn’t notice that directly submitted to go up. Later, it was reported that the final Post interface had the problem of repeated calls. Some names would call the Post interface twice, while others would only call it once.

To solve

Finally, consult the boss to learn that when strictly executing only one calculation, you must turn off Spark detection execution! That is, don’t set spark. Speculation =true in code or scripts; spark is set to false by default.

why

When Spark starts the detection execution, it starts the second execution based on the execution time of the data slice. When data from Partition1 is executed over a certain period of time, but is still incomplete, an executor2 program will run data from Partition1, which runs first, and kill the remaining unfinished data to return the final result. Because my code, obtain the Post connection may be long, is beyond the scope of the test execution time, and part of the name outside of the realm of the test execution was launched executor2 calculation, although in the final return status results only a data, but it actually has invoked the twice, see return status has no results.

lesson

Don’t blind CV, all submitted things should be clear, careful thinking can be.