This series of blogs summarizes and shares examples drawn from real business environments, and provides practical guidance on Spark business applications. Stay tuned for this series of blogs. Copyright: This set of Spark business application belongs to the author (Qin Kaixin).

  • Qin Kaixin technology community – big data business combat series complete catalogue
  • Spark Business Application: In-depth analysis of Spark data skew case tests and tuning guidelines
  • Spark Business Application Deployment – In-depth analysis of Spark resource scheduling parameter tuning

1 Spark Internal resource relationship

2 Optimize the configuration of Spark running resources

./bin/spark-submit \ --master yarn-cluster \ --num-executors 100 \ --executor-memory 6G \ --executor-cores 4 \ - driver - the memory 1 g \ conf spark. The default. The parallelism = 1000 \ - conf spark. Storage. MemoryFraction = 0.5 \ - the conf The spark. Shuffle. MemoryFraction = 0.3 \Copy the code

3 Set Spark running resource parameters

  • num-executors
  • Parameter Description: This parameter is used to set the number of Executor processes used to execute the Spark job. When the Driver applies for resources to the YARN cluster manager, the YARN cluster manager starts as many Executor processes on each working node in the cluster as possible based on your Settings. This parameter is very important because if left unset, you will only be given a small number of Executor processes to start by default, and your Spark job will run very slowly.
  • Parameter tuning suggestions: It is recommended that 50 to 100 Executor processes be set for each Spark job. Too few or too many Executor processes are not recommended. Too few configurations fail to make full use of cluster resources. Too many, and most queues may not be fully resourced.

  • executor-memory
  • Parameter Description: This parameter sets the memory for each Executor process. The size of Executor memory often determines Spark job performance and is directly related to common JVM OOM exceptions.
  • Parameter tuning Suggestion: The memory of each Executor process ranges from 4 GB to 8 GB. However, this is only a reference value, and the specific setting depends on the resource queues of different departments. * * * can’t exceed the maximum memory limit for your team’s resource queue. * * * can’t exceed the maximum memory limit for your team’s resource queue. In addition, if you are sharing the resource queue with others in the team, it is recommended that the amount of memory requested should not exceed 1/3 to 1/2 of the maximum memory of the resource queue. Otherwise, your Spark job will occupy all the resources in the queue and cause other colleagues’ jobs to fail to run.

  • executor-cores
  • Parameter Description: This parameter sets the number of CPU cores for each Executor process. This parameter determines the ability of each Executor process to execute task threads in parallel. Because each CPU core can execute only one task thread at a time, the more CPU cores an Executor process has, the faster it can execute all its assigned task threads. Parameter tuning Suggestions: Set the number of Executor CPU cores to 2 to 4. This also depends on the resource queue of each department. You can see the maximum CPU core limit of your resource queue and determine how many CPU cores can be allocated to each Executor process based on the set number of executors. Similarly, if you are sharing the queue with others, do not exceed 1/3 to 1/2 of the total CPU core cores in the queue if you are sharing the queue with others.

  • driver-memory
  • Parameter Description: This parameter sets the memory of the Driver process.
  • Parameter tuning suggestion: The memory of the Driver is usually not set, or about 1 GB should be enough. Note that if you need to use the COLLECT operator to pull all RDD data to the Driver for processing, ensure that the Driver memory is large enough; otherwise, OOM memory overflow occurs.

  • spark.default.parallelism
  • Parameter Description: This parameter is used to set the default number of tasks per stage. This parameter is very important. If you do not set this parameter, it may directly affect your Spark job performance.
  • Parameter tuning Suggestion: The default number of Spark tasks is 500 to 1000. A common mistake many students make is not to set this parameter. In this case, Spark sets the number of tasks based on the number of underlying HDFS blocks. By default, one HDFS block corresponds to one task. Generally, Spark defaults to a small number of tasks (e.g., dozens of tasks), and if the number of tasks is too small, all of your Executor parameters will be discarded. Imagine that no matter how many Executor processes you have, or how much memory or CPU you have, but you only have one or 10 tasks. Ninety percent of your Executor processes probably have no tasks to execute at all, wasting resources! Therefore, the Spark website recommends setting the parameter to num-executors * 2 to 3 times executor-cores. For example, if the total number of Executor CPU cores is 300, 1000 tasks are recommended. In this case, Spark cluster resources can be fully utilized.

  • spark.storage.memoryFraction
  • Parameter Description: This parameter is used to set the percentage of RDD persistent data in Executor memory. The default value is 0.6. This means that 60% of the default Executor’s memory can be used to hold persistent RDD data. Depending on the persistence strategy you choose, if you run out of memory, data may not be persisted, or data may be written to disk.
  • Parameter tuning suggestion: If the Spark job has many RDD persistent operations, you can increase the value of this parameter to ensure that persistent data can be stored in memory. Therefore, data can only be written into disks, which reduces performance. However, if the Spark job has a large number of Shuffle operations but a small number of persistent operations, the value of this parameter should be reduced. It is also recommended to lower the value of this parameter if a job is slow due to frequent GC (the GC time of the job can be observed using the Spark Web UI), which means that the task is running out of memory to execute user code.

  • spark.shuffle.memoryFraction
  • Parameter Description: This parameter is used to set the proportion of Executor memory available for aggregation after a task is pulled to the output of the task in the previous stage during shuffle. The default value is 0.2. That is, by default, only 20% of Executor’s memory is used for this operation. During the shuffle operation, if the memory usage exceeds the 20% limit, the excess data will be overwritten to disk files, which greatly reduces the performance.
  • Parameter tuning suggestions: If Spark has a few PERSISTENT RDD operations and a lot of Shuffle operations, you are advised to reduce the memory ratio of persistent operations and increase the memory ratio of shuffle operations to avoid memory insufficiency when there is too much data during shuffle and the memory must be overwritten to disks, which deteriorates the performance. It is also recommended to lower the value of this parameter if you find that a job is running slowly due to frequent GC, meaning that the task is running out of memory for executing user code.

4 summarizes

Qin Kaixin in Shenzhen