This article is shared by breakDawn from The Huawei Cloud Community: How Spark makes Clusters highly available.

Let’s look at how Spark handles master, worker, and Executor exceptions.

Fault tolerant mechanism – Exeuctor exits

You can start by assuming that the executor in the worker executes a task and sends an unexplained exception or error, and then the corresponding thread disappears. We’ll see what we can do at this point.

The picture above sums it up as follows:

Executor wrapped by the backend process, if abnormal, he will perceive, and call the executorRunner. ExitStatus (), notify the worker.

Take a look at what happens after notifying the worker:

  • The worker notifies the master, which clears exectorInfo and dispatches the worker to recreate it
  • As you can see, the worker’s command to create executors is still for the master to schedule and manage.

The next step is to rebuild the Executor and restart the execution of the local task (so the data will be pulled again, so that the data cached by the sender can be used).The complete flow chart is as follows:

Worker abnormally exits

Assuming that the worker hangs, what will the exeuctor and master do when they are executing the task? As follows:

You can see that the worker has a shutdownHook that helps close the executor being executed.

However, the worker hangs and cannot send a message to the master. What should I do?

As mentioned in the previous section, there is a heartbeat between the master and worker, so there will be the following processing:

It can be seen that when the master finds that the worker’s heartbeat is lost, it will:

  • Delete worker information from the execution list
  • Resend the worker creation operation to the corresponding Spark node
  • Notify the driver that all exectors in the worker are lost

See what the worker rebuild and driver do at this point:

Another important concept can be seen here:

  • The master cares about the worker state
  • The driver cares about executor progress
  • After exeuctor is rebuilt, it needs to be registered with driver

The complete flow chart is as follows:

Master abnormal

Since the master does not participate in the calculation of tasks, but only manages the worker, there are two cases of master exception:

1: The master exits abnormally when the task is running properly

The process is as follows:

As you can see, when the task is running properly, the driver will only trigger the master to clean up at the end of the task, but the master process has already died, so it doesn’t matter.

2: When the master hangs during the task execution, the worker and executor become abnormal

You can see there’s no way to restart Exeuctor at this point. The driver side will look like the task is not progressing.

To avoid this situation, the master can be stateless and then master Dr. Of course, the master node is less likely to crash unless it is considered a kill or deployment node failure.

Click to follow, the first time to learn about Huawei cloud fresh technology ~