Spark parallelism refers to the number of tasks in each Stage of a Spark job. It also represents the parallelism of a Spark job in each Stage. Reasonable setting of parallelism can be considered from the following aspects:

  • 1. Make full use of task resources, i.e., the parallelism is slightly higher than the amount of CPU resources allocated to the * * exector (= num- Exector * Core used by each Executor);
  • 2. The average Partition size should not be too small, generally around 100 MB is the most appropriate;
  • 3. Set a balance based on the resources allocated to the task and the amount of data to be calculated by the task.

1. Set the number of tasks to 2-3 times the total number of Spark Application CPU cores to improve the efficiency and speed of Spark. 2. Spark. Default paralleism default is of no value, if the set value, such as 10, is to be effective in the Shuffle. For example, val rdd1 = rdd2.reduceByKey(_ +), the number of rDD2 partitions is 10, and the number of RDD1 partitions is not affected by this parameter. 3. If the read data is in HDFS and the number of blocks is increased, split and block are one-to-one by default, and split corresponds to Partition in RDD. Therefore, the number of blocks is increased and the parallelism is improved. 4. The reduceByKey operator specifies the number of partitions. For example, val rdd2 = rdd1.reduceByKey(+ _, 10); Val rdd3 = rdd1.join(rdd2). The number of partitions in rDD3 is determined by the maximum number of partitions in the parent RDD. Therefore, when using the join operator, increase the number of partitions in the parent RDD. 6. Set the spark. SQL. Shuffle. Partition, configure the spark in SQL shuffle in the process of the number of partition.