Learn more about Java basics


Learning materials: Spark solutions to OOM problems and optimization summary

sequence

  • Set the buffer size on the Shuffle Reduce to avoid OOM
  • The shuffle file pull failure caused by THE JVM GC is resolved
  • Resolve various serialization errors
  • Solve the problem of network adapter traffic surge caused by yarn-client mode
  • Resolve the memory overflow problem of JVM stack in Yarn-cluster mode

Set the buffer size on the Shuffle Reduce to avoid OOM

Buffer determines how much data Reduece can pull each time. Because the data is pulled from the buffer first. The heap memory allocated by the subsequent executor (0.2) is then used for subsequent aggregation and function execution.

The default cache size is 48MB. In most cases, tasks on the reduce end may not always use up to 48MB of data. Probably most of the time, you pull 10 megabytes of data, and you’re done.

Most of the time, there might not be a problem. However, sometimes, the map side of the data is very large, and then write very fast. When all tasks on the reduce end are pulled, all tasks reach the maximum limit of their buffer. The buffer is 48M, and all tasks are filled. At this point, along with the code for the aggregation function that your Reduce side executes, you can create a large number of objects. Perhaps, all of a sudden, memory will not hold up, will OOM. Memory overflow occurs on the Reduce server.

In this case, reduce the size of the Task buffer on the Reduce end. I prefer to retrieve tasks several times. However, if the number of tasks on the Reduce end is small each time, OOM memory overflow will not easily occur. (For example, it can be adjusted to 12M)

The shuffle file pull failure caused by THE JVM GC is resolved

There is a situation that sometimes occurs, very common, in Spark jobs; Shuffle file not found. (Very, very common in Spark jobs) and, sometimes, it’s an occasional occurrence. Sometimes, after this happens, the stage and task will be submitted again. Run it again, and you’re done. There is no such mistake.

For example, the Executor JVM process may not have enough memory. At this point, GC might be performed. Minor GC or Full GC. In short, once a JVM occurs, it causes all worker threads within the Executor to stop, such as the BlockManager, netty based network communication.

The executor of the next stage may not have stopped, and the task wants to pull its own data from the exeuctor of the previous stage. As a result, the task has not pulled its own data for a long time because it is in gc.

You’ll probably say shuffle file not found. However, it is possible that the next stage resubmits the stage or task and there is no problem executing it because the JVM may not be in GC the second time.

The solution parameters are as follows:

spark.shuffle.io.maxRetries 3
Copy the code

If the shuffle file fails to be pulled, the maximum number of retries (the file will be pulled several times) is set to three by default.

spark.shuffle.io.retryWait 5s
Copy the code

The second parameter, which means the time interval between each retry to pull the file, defaults to 5s minutes.

By default, say the executor of the first stage is doing a long FULL GC. The executor of the second stage tries to pull the file, fails to pull it, and by default retries the pull three times, five seconds apart. The maximum waiting time is 3 * 5s = 15s. If no shuffle file is retrieved within 15 seconds. “Shuffle file not found.”

In view of this kind of situation, we can carry on the preparatory parameter adjustment completely. Increase the value of the above two parameters to a larger value, and try to ensure that the task of the second stage can definitely pull the output file of the previous stage. Avoid reporting shuffle file not found. Stages and tasks might then be resubmitted for execution. That’s not good for performance either.

spark.shuffle.io.maxRetries 60
spark.shuffle.io.retryWait 60s
Copy the code

Can endure up to an hour without a shuffle file. Just set the maximum possible value. Full GC can’t be over for an hour.

In this way, it can avoid the shuffle file not found caused by GC.

Resolve various serialization errors

Submit the Spark job in Client mode and view the locally printed logs. If an error log like Serializable, Serialize, etc., is reported, then serialization problems have occurred. Although it is an error, but serialization error, should be relatively simple, very easy to handle.

Three points to note when reporting serialization errors:

  1. If you use a variable of an external custom type in your operator function, then your custom type must be serializable.
final Teacher teacher = new Teacher("leo");

studentsRDD.foreach(new VoidFunction() { public void call(Row row) throws Exception { String teacherName = teacher.getName(); . }}); public class Teacher implements Serializable { }Copy the code
  1. If a custom type is to be used as an element type in an RDD, the custom type must also be serializable
JavaPairRDD<Integer, Teacher> teacherRDD
JavaPairRDD<Integer, Student> studentRDD
studentRDD.join(teacherRDD)

public class Teacher implements Serializable {
  
}

public class Student implements Serializable {
  
}
Copy the code
  1. In either case, use a third-party type that does not support serialization
Connection conn = 

studentsRDD.foreach(new VoidFunction() { public void call(Row row) throws Exception { conn..... }});Copy the code

Connection does not support serialization

Solve the problem of network adapter traffic surge caused by yarn-client mode

What problems may occur in yarn-client mode?

Since our driver is started on the local machine and is fully responsible for scheduling all tasks, That is, multiple executors running in the YARN cluster communicate frequently, including task start messages, task execution statistics messages, task running status, and shuffle output results.

Let’s imagine that. Let’s say you have 100 executors, 10 stages, and 1,000 tasks. Each stage runs with 1000 tasks submitted to executor for execution, with an average of 10 tasks per executor. Then there is the problem of the driver communicating frequently with the 1000 tasks running on the executor. The communication message is very much, the communication frequency is very high. Once a stage is run, the next stage is run, and again frequent communication.

Throughout the life cycle of Spark, communication and scheduling are frequent. All of this communication and scheduling is sent and received from your local machine. This is the worst place to kill. Your local machine is likely to have a lot of frequent network traffic within 30 minutes of the spark job running. At this point, the network traffic load on your local machine is very, very high. Will cause your local machine network card traffic surge!!

A surge in network card traffic on your local machine is certainly not a good thing. Because in some large companies, the use of each machine is monitored. A single machine will not be allowed to consume a lot of network bandwidth and so on. Operation and maintenance personnel. It could have a negative and bad impact on the company’s network, or on other machines (your machine is still a virtual machine).

Resolve the memory overflow problem of JVM stack in Yarn-cluster mode

Practical experience, yarn-cluster problems encountered:

Sometimes, spark jobs that contain Spark SQL may run in yarn-client mode, which can be submitted and run normally. In yarn-cluster mode, the commit operation may fail and the JVM PermGen (permanent generation) memory overflow will be reported, in OOM.

In yarn-client mode, the driver runs on the local machine, the PermGen configuration of the JVM used by Spark is a local Spark-class file (the Spark client is configured by default), and the permanent generation size of the JVM is 128 MB. This is ok. However, in yarn-cluster mode, the driver runs on a node in the YARN cluster and uses the default configuration (PermGen permanent generation size) of 82M.

Spark-sql contains complex SQL semantic parsing, syntax tree conversion, etc. In this case, if your SQL itself is very complex, it may cause performance consumption and memory consumption. The PermGen permanent generation may have a large footprint.

So, at this point, if the occupation demand for the permanent generation exceeds 82M, but it is within 128M; In yarn-client mode, the default value is 128 MB, which still works. If the default is 82M in yarn-cluster mode, there is a problem. PermGen Out of Memory error log is reported.

How to solve this problem?

Since it is the JVM’s PermGen permanent generation that is out of memory, it is out of memory. Set more PermGen for driver in Yarn-cluster mode.

Add the following configuration to the spark-submit script:

--conf spark.driver.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=256M"
Copy the code

This sets the size of the driver permanent generation. The default is 128MB and the maximum is 256M. In this case, it is almost guaranteed that your Spark job will not run out of permanent memory as described in yarn-cluster mode.