This is the 16th day of my participation in the More text Challenge. For more details, see more text Challenge

General Performance tuning 1: Optimizes resource allocation

  • The first step in Spark performance tuning is to allocate more resources to tasks. Within a certain range, increased resource allocation is proportional to the improvement in performance. After the optimal resource allocation is achieved, consider the performance tuning policies described in the following sections.

  • Resource allocation is specified when Spark tasks are submitted using scripts. The standard Spark task submission script is as follows:

bin/spark-submit \
--class com.atguigu.spark.Analysis \
--master yarn
--deploy-mode cluster
--num-executors 80 \
--driver-memory 6g \
--executor-memory 6g \
--executor-cores 3 \
/usr/opt/modules/spark/jar/spark.jar \
Copy the code

The resources that can be allocated are shown in the table:

The name of the instructions
–num-executors Configure the number of Executors
–driver-memory Configure Driver memory (minor impact)
–executor-memory Configure the memory size for each Executor
–executor-cores Configure the number of CPU cores for each Executor

Regulation principle: Try to adjust the resources allocated to the task to the maximum available resources.

  • For specific resource allocation, we discuss the two Cluster operation modes of Spark respectively:
    • The first is the Spark Standalone mode. Before you submit the task, you must know or have access to the resources available to you from the o&M department. When writing the submit script, you allocate the resources according to the resources available. For two CPU cores, you specify 15 executors, each of which is allocated eight gigabytes of memory and two CPU cores.
    • The second mode is the Spark Yarn mode. Since Yarn uses resource queues to allocate and schedule resources, the submit script allocates resources based on the resource queues to which Spark jobs are submitted. For example, the resource queues have 400 GB memory and 100 CPU cores. So you specify 50 executors, and each Executor is allocated 8 gb of memory and 2 CPU cores.

After adjusting the resources, the performance will be improved as follows:

The name of the parsing
Increase the number of Executors If resources allow, increasing the number of executors can improve the parallelism of executing tasks. For example, if you have four executors with two CPU cores each, you can execute eight tasks in parallel. If you increase the number of executors to eight (with resources available), you can execute 16 tasks in parallel, doubling the concurrency.
Increase the number of CPU cores per Executor If resources permit, increasing the number of Cpu cores per Executor can improve the parallelism of tasks. For example, if you have four executors with two CPU cores each, you can execute eight tasks in parallel. If you increase the number of CPU cores per Executor to four (with resources available), you can execute 16 tasks in parallel, doubling the concurrency.
Increase the amount of memory per Executor After increasing the amount of memory per Executor, resources permitting, there are three performance improvements:

1. More data can be cached (that is, the RDD can be cached). The data that can be written to the disk is reduced or even not written to the disk.

2. More memory is provided for the shuffle operation. That is, more space is available for storing data pulled by the Reduce end.

When tasks are executed, many objects may be created. When the memory is small, frequent GC occurs. Increasing the memory can avoid frequent GC and improve the overall performance.

Supplementary: Configure the Spark Submit script in the production environment

bin/spark-submit \
--class com.atguigu.spark.WordCount \
--master yarn\
--deploy-mode cluster\
--num-executors 80 \
--driver-memory 6g \
--executor-memory 6g \
--executor-cores 3 \
--queue root.default \
--conf spark.yarn.executor.memoryOverhead=2048 \
--conf spark.core.connection.ack.wait.timeout=300 \
/usr/local/spark/spark.jar
Copy the code

Parameter configuration Reference values:

--num-executors: 50~100 --driver-memory: 1G~5G --executor-memory: 6G~10G --executor-cores: 3 --master: must use yarn in the actual production environmentCopy the code