From @twt community, by Meng Yang.

1. Problem description

At present, we preprocess source data files from upstream by writing Hadoop MapReduce program. After the source data file is sent to the Hadoop cluster, our preprocessor will perform such operations as encoding conversion, data de-duplication, time-adding zipper, data cleaning, and error data processing on the source data to generate ODS layer data attached to the source for upper modeling use.

The system runs stably all the time without any problems. However, for a period of time, the pre-processing jobs of some source files have frequently been stuck for a long time, which leads to the Hadoop cluster resources being occupied for a long time. Other jobs cannot be started normally due to insufficient resources, which affects the timeliness of the pre-processing. To this we have carried on the analysis and the processing from various aspects, the specific process is as follows.

2. Problem analysis

2.1. Possibility analysis of data skew

When problems occur, we first consider that the problem of data skew may lead to the phenomenon of long time stuck work. So let’s first examine the YARN console’s job information as shown in Figure 2-1:

From the figure above, we can see that a large number of reduce tasks are running for a long time, rather than a small number of reduce tasks. Data skew typical phenomenon is that most of the reduce task execution time is shorter, only a few reduce task to run for a long time, can say above reflect the situation is not completely consistent with the data skew, at the same time, we also for MR programs dealing with the source data files were grouped according to the Key value calculation, found that did not have the data distribution is not balanced, Therefore, we rule out the possibility that data skew could cause the job to get stuck.

2.2. Hadoop cluster component state analysis

By checking the status of each component on the UI page of Hadoop cluster and checking the log information of system services such as NameNode, DataNode and ResouceManager, we can confirm that the cluster and each component have no problems and run normally, so as to eliminate the problem of the cluster itself causing the job stuck.

2.3. Log analysis

We found that these long-running jobs were all stuck in the Reduce phase, and a large number of Reduce task cards stopped running at 27%-30% of the schedule, as shown in Figure 2-1. Some map tasks and reduce tasks also failed. So we looked at the following log information for this job:

1) Job logs obtained via yarn-logs:

2) The container log corresponding to Job is as shown in Figures 2-2 and 2-3

Figure 2-2 Exception job container information 1

Figure 2-3 Exception job container information 2

3) Failed map task log:

Figure 2-4 Failure Map Task Log

4) Failed reduce task log:

Figure 2-5. Failure Reduce Task Log

5) syslog log for a reduce task that is stuck for a long time:

Syslog. shuffle log of reduce task that is stuck for a long time:

7) The process stack of reduce tasks that have been stuck for a long time

According to the above logs, the following analysis is made:

Exception 1: (1) the job log (communication thread] org.. Apache hadoop. Mapred. Task: Communicationexception. This is a status reporting mechanism for tasks, and there will be a retry mechanism when an exception occurs (default is 3 times). If there is no abnormality in the follow-up, it means that it has returned to normal and there is no need for attention.

(2) anomalies in the job log 2: [communication thread] org.. Apache hadoop. Yarn. Util. ProcfsBasedProcessTree: Error reading the streamjava. IO. IOException: not the process. After the abnormal message appeared, the task continued to be stuck for 11 hours, and it was still in RUNNING state on the task interface. The reason is unknown.

(3) The log of failed map task and failed reduce task shows that the task timeout, the container is killed, and the corresponding process exit code is 143. A task shown in the UI log screenshot is not responsive for more than 10 minutes (corresponding to mapreduce.task.timeout=10 minutes in the yarn container configuration) and is killed. This situation usually occurs in the scenario of high concurrent tasks and intense IO competition. Consider optimizing parameters such as timeout time, container memory configuration, etc. However, this problem is not the cause of the job stalling, and it would not happen if the extremely long reduce task could be killed over time.

(4) For a long time stuck reduce task, the syslog log “IOException: There is no “abnormal process, usually read/proc/meminfo, the local files such as/proc/cpuinfo, under normal circumstances any user can access these files, can be diagnosed by no more file descriptors to open the corresponding process under the/proc/file information. Through the “ulimit-n” query, it is found that the number of open files corresponding to all nodes and users is 100000, which may be a little low when the system is busy. It will be considered to increase to 150000 later. But that’s not what caused the job to get stuck.

(5) By looking at the syslog. Shuffle log of the reduce task that has been stuck for a long time, it can be found that the corresponding shuffle task has processed a large number of map results in a relatively short time, but no error was reported in the log of the shuffer task itself, showing that the task is stuck. The reason is unknown.

(6) Since the above log did not find the clue that the task was stuck for a long time, we also checked the stack information of the stuck reduce task process. From the stack information, we found that the thread state of the reduce task obtaining the completion event of the map task was blocked, that is to say, the reduce task was waiting for the signal of the completion of the map task but did not receive it. The fact that all of the job’s map tasks have been completed causes the entire job to get stuck, which can occasionally happen under high concurrency. The reason why reduce does not trigger the timeout kill mechanism is that there is no abnormality in the ping heartbeat sending of reducetask, and the event acquisition thread does not exit (if exit due to abnormality, it will directly lead to the abnormality of reduceTask and rerun the task). We further checked the following three parameters of job:

Since the network timeout time of TCP based Socket is set, when SocketTimeoutException occurs when the task node reads data, it will automatically send ping packets to the server to test whether the current connection between the client and the server is normal. The corresponding parameter is (default is 60000ms). Ipc.client.rpc-timeout. ms is the value of the timeout that the RPC client waits for. If the value of the parameter is 0, the timeout will not occur if the remote method call does not receive any data, as long as the ping service is normal. If you do not receive the data from the remote method within this time, you will timeout and stop sending the ping service. Thus, you can effectively avoid the death of the ping service. The current cluster ping mechanism is enabled, and it sends ping service to the server regularly every 1 minute. However, the timeout is set to 0, that is, it will never timeout, and it will always be in the stage of reading map data. In other words, AM always thinks that the reduce task is still alive and does not kill it according to the task timeout mechanism. The core cause of the problem has been found.

2.4. Peripheral situation analysis of job operation

While analyzing the logs, we also looked at other peripheral situations in which jobs were stuck:

1) A large number of jobs are adjusted during this period;

2) Recently, the cluster has expanded capacity and is undergoing data balancing operation. The IO competition among nodes is fierce;

3) In view of Hadoop’s task allocation principle (the principle that local data comes first and calculation is carried out on the node where the data is located), during the peak period of the task, the IO competition of some nodes is fierce, which will lead to the task timeout phenomenon.

2.5. Analysis summary

To sum up, the following conclusions can be drawn:

1) IPC.Client-rPC-Timeout. ms=0 parameter setting is not reasonable, there is no timeout exit mechanism, resulting in high concurrency, IO competition in the scene, triggering the problem of long time stuck task;

2) The number of node file descriptors is too low, resulting in the “IOException: no process” exception in the task execution;

3) Parameters such as timeout time and container memory configuration are not optimized enough, resulting in some tasks failing to timeout and exit, affecting the execution of Job;

4) A large number of job concurrency and data balancing operations to some extent aggravate the IO competition degree, and indirectly trigger the occurrence of the problem of stuck.

Among them, 1) is the main factor causing the problem.

3. Problem solving

Once the cause of the problem is found, it is easy to make a solution. We solve the problem by controlling job concurrency, reducing the bandwidth of data balancing operation and adjusting the following cluster parameters:

After the above operation is performed, the problem is resolved.

After this problem processing, we also summarized a set of solutions to the problem of long execution of Hadoop program:

1. First check whether there is data skew, because in most cases, this is the cause of long execution of Hadoop programs. Typical phenomenon of data slanting is that most reduce tasks have short execution time and only a few reduce tasks run for a long time. At the same time, the data corresponding to one key is much more than the data corresponding to other keys. Once the above situation occurs, it can be judged that data slanting occurs and the data with a large number of key values needs to be processed separately. For example, random numbers are added to the key to scatter the data, so that the data can be shuffled to all reduce nodes on an average basis as far as possible to make full use of the computing power of each node.

2. If there is no data skewing, it is necessary to first check the operation of the main components of the cluster, such as HDFS, YARN, Spark, Hhive, etc., to confirm whether the components are normal. A considerable number of Hadoop programs have long execution problems caused by component operation problems. In general, the status of each component on the UI page of the Hadoop cluster and the system log of each component can be viewed to determine whether there is a problem with the component running.

3. If the above two items are not a problem, we need to do a more detailed analysis of the logs, focusing on the job logs, container logs, map and reduce node logs, and map and reduce task process stack, especially the task process stack is very helpful for the analysis and solution of some difficult problems. Generally, there are many reasons to cause the timeout problem in this step, which needs to be analyzed on a case-by-case basis. However, we might as well focus on whether the relevant timeout parameter is caused by unreasonable setting, such as the following parameters

The above thinking is only for reference, I hope to help you solve the problem.