In a company, there is always a clear division of labor. Each part is responsible for its own tasks and everyone works together to maintain the normal operation of the company. Similarly, Spark is like a company, and there are many roles in it

Spark Terminology:

Master: the Master node of resource management. It manages resources in the cluster by managing various workers

Worker, the master’s slave node, manages cluster resources with the master

Application: An Application written by a user, just like the Wordcount for HelloWorld in Spark, is an Application

Driver: An application consists of multiple tasks that are sent to an Executor as a unit of work. Therefore, after the Driver obtains the application, Application tasks are planned and distributed to executors for execution

Executor: a process started for an application on a node managed by the worker process. This process is responsible for running the tasks that are sent by the Driver.

Job: Contains groups of parallel tasks that correspond to the action operator, which packages and distributes the raw materials of multiple tasks to an Executor for execution

Executive relationships between relationships:

Task level

Application—— Consists of multiple jobs (related to the number of actions) ——stage——task, which is a thread

Resource level

Master — — — — — — — — the worker — — — — — — — — — executor — — — — — — — threadpool last run (task)



In the Spark cluster, each worker manages the resources on the node, while the master manages the resources of the whole cluster by mastering the worker

Encapsulate the application into jar packages and upload them to the client. Then the client submits the application to the Spark cluster.

A Driver process is started on the client. The Driver splits the application into jobs based on the code, then splits the jobs into tasks. The Driver then requests resources from the Master and starts an executor and a thread pool on the well-resourced nodes


The Client. Speaking of the client concept in the basic flow above, why use the client instead of directly submitting tasks to the cluster?

1. Minimize performance differences among servers in the cluster to prevent barrel effect

If directly submitted to the cluster, drivers will be started on one worker to pull and distribute tasks one by one. When there are too many workers, serious DISK I/O will be caused. In addition, because the same node is used for distribution for a long time, frequent communication will cause performance difference between this node and other nodes. That’s where the barrel effect comes in

2, easy to maintain the cluster, to prevent cluster damage behavior

When the client is created, different users can use different permissions to submit programs, and different permissions can be set on the client. Moreover, although the client node will also appear in the process of using the barrel effect, but because outside the cluster, there will be no impact on the cluster

After the client is set up to avoid some unnecessary impact, two different application submission approaches emerge to better use the performance of the cluster

Client


1. The worker will report the resource situation to the master.

2. The master knows the resource status of the cluster, how much core and memory there are in Spark, and how much core and memory each worker manages

3. Use Spark-Submit to submit an application on the client and start a Driver process on the client

4. The driver process will apply for resources from the master.

5. The Master sends a request to start executor on the worker node that has sufficient resources

6. Once executors start up, they reverse register with the Driver and know how many executors there are to compute with

7. The Driver sends tasks and reclaims the calculation results. The Driver monitors the executor’s calculation process, including retries if errors occur, and logs the task’s run

Cluster

The Client submission mode has certain risks. If a large number of tasks are performed, the calculation results are recycled to the Driver. As a result, the Driver exits unexpectedly and tasks cannot be scheduled, that is, programs cannot be submitted to the cluster

The running program is basically the same as the client, except that the Driver will start on one of the worker nodes with sufficient resources. Similarly, the Driver will apply for resources, start Executor on the worker node, and then reverse register with the Driver. The difference of cluster submission mode is that it will not pull the calculation results back to the client for viewing, that is to say, the monitoring of cluster running status cannot be implemented on the client, but can only be viewed on the Web level

The table below illustrates the differences between the two methods of submission

Need more architecture information and video on cache breakdown, add wechat official account: Java Architect Union

Or scan the QR code to add



Xiaobian will issue articles and technical explanation videos from time to time, thank you