This article has been authorized by the author Yue Meng NetEase cloud community release.


Welcome to visit NetEase Cloud Community to learn more about NetEase’s technical product operation experience.


Logs are the key to locating faults at any time, and Spark is no exception. Properly configuring and obtaining Spark driver, AM, and Executor logs will improve the efficiency of locating faults. The following describes spark configurations.

1) dirver log configuration, can spark. Driver. ExtraJavaOptions loading log4j Settings. The properties file paths, such as:

Spark. Driver. ExtraJavaOptions – Dlog4j. = the configuration file: / home/hadoop/ym/spark – 1.6.1 – bin – hadoop2.6 / conf/log4j properties

Log4j. properties

# Set everything to be logged to the console
log4j.rootCategory=INFO, console,rolling
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} - [%p] - [%l] %m%n
log4j.appender.rolling.Append=true
log4j.appender.rolling.Encoding=UTF-8
log4j.appender.rolling.MaxBackupIndex=5
log4j.appender.rolling.MaxFileSize=200MB
log4j.appender.rolling.file=/home/hadoop/ym/logs/${spark.app.name}.driver.log
log4j.logger.org.apache.spark=INFO
log4j.logger.org.eclipse.jetty=WARNCopy the code

This configuration allows Application logs to be stored in the appname.driver. log directory specified by you.

. 2) the executor of the log, can spark executor. ExtraJavaOptions loading log4j Settings. The properties file paths, such as:

spark.executor.extraJavaOptions – Dlog4j. = the configuration file: / home/hadoop/ym/spark – 1.6.1 – bin – hadoop2.6 / conf/log4j – executor. Properties

Log4j. properties

# Set everything to be logged to the console
log4j.rootCategory=INFO,rolling
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} - [%p] - [%l] %m%n
log4j.appender.rolling.Append=true
log4j.appender.rolling.Encoding=UTF-8
log4j.appender.rolling.MaxBackupIndex=5
log4j.appender.rolling.MaxFileSize=2MB
log4j.appender.rolling.file=${spark.yarn.app.container.log.dir}/stdout
log4j.logger.org.apache.spark=DEBUG
log4 j.logger.org.eclipse.jetty=warn executor of the log is divided into the runtime journal and an end, two kinds of log separately stored in different position, the running time of the log file stored in yarn. The nodemanager. Log -dirs/ {ApplicationID}, such as:  <property> <name>yarn.nodemanager.log-dirs</name> <value>file:/mnt/ddb/2/hadoop/nm</value> </property>Copy the code

The executor log at runtime is stored in:

root@xxx:/mnt/ddb/2/hadoop/nm/application_1471515078641_0007# ls

container_1471515078641_0007_01_000001 container_1471515078641_0007_01_000002 container_1471515078641_0007_01_000003

Container_1471515078641_0007_01_000001 is the first container assigned by RM for Application_1471515078641_0007, that is, the container where AM resides. The first container is shipped to start the AM. ContainerID is of the form, Container_APPID_01_000001. If you search for container_APPID in the RM log file, you can see the distribution and lifecycle of containers assigned to the APPID.

Logs are aggregated to HDFS in the/TMP /logs/${user}/logs directory, for example:

drwxrwx— – root supergroup 0 2016-08-18 18:29 /tmp/logs/root/logs/application_1471515078641_0002

drwxrwx— – root supergroup 0 2016-08-18 19:10 /tmp/logs/root/logs/application_1471515078641_0003

drwxrwx— – root supergroup 0 2016-08-18 19:17 /tmp/logs/root/logs/application_1471515078641_0004

3) am log configuration, can spark. Yarn. Am. ExtraJavaOptions loading log4j Settings. The properties file paths, such as:

spark.yarn.am.extraJavaOptions – Dlog4j. = the configuration file: / home/hadoop/ym/spark – 1.6.1 – bin – hadoop2.6 / conf/log4j – executor. Properties

Since both AM and Executor run in containers, you can refer to the executor log4j.properties file for details.

Add a few:

Under the following key said, meet the spark the question how to troubleshoot log, are trying to identify what the log, the spark of running processes can refer to http://ks.netease.com/blog?id=5174, the side each process represented different log files.

1. Troubleshooting methods for faults during SC initialization

1) If the driver fails to be submitted to RM, check driver logs first and RM logs according to APPID to check the APP life cycle

2) There is a problem in the process of RM starting AM. Firstly, check the RM logs to see which NM the first container assigned to APP was sent to, and go to the corresponding NM to check the life cycle of the container ACCORDING to the containerID

3) After the AM is started, a problem occurs during the process of applying for containers from RM. View the AM logs, that is, the logs of the first container, and check which containers are applied for and which nodes are delivered to these Containers. Then go to the NM log of the node to view its life cycle according to the containerID

4) The executor has been started and cannot be registered with the driver

Second, sc. action problem investigation in the future

1) First check the driver log to see which jobs failed, then find the failed job, check the stages of the job, and check the tasks of the stage and which tasks failed. Locate the executor of the failed task and check the life cycle of the task based on the task ID in the executor log.

ExecutorLost, in addition to the executorLost executor log, you should also look at the nm log of the containerID to see the lifecycle of the container

3) If a fault occurs in HDFS writing, check the NM logs first, and check the datanode logs based on the NM logs.

Some of the problems may exist due to my own understanding, please forgive me.

Finally, troubleshooting is the best way to learn. After troubleshooting, we can understand the principle deeply. The more problems we deal with, the faster we learn.


Free experience cloud security (EASY Shield) content security, verification code and other services


For more information about NetEase’s technology, products and operating experience, please click here.




WS-Federation – OC Static code check