NodeManager1 The CPU is overloaded. The process is still running but does not send heartbeat messages to ResourceManager. The process repeats the operations described in 2. After the heartbeat stops for a period of time, the HEARTBEAT is reconnected to RM, but the CPU is still high. After a period of time, the heartbeat stops again and circulates continuously.

1. Localizing: The container starts downloading resources from HDFS. The HDFS file status changes from INIT to download.

On August 25, 2018 16:15:38 592 information org.. Apache hadoop. Yarn. Server nodemanager. Containermanager. Localizer. LocalizedResource: resource HDFS: / / mycluster/user/HDFS /. The staging/application_1444990016246_29569 / libjars/avro - mapred - hadoop2. Jar shift from INIT download ` ` ` 2. The CONTAINER is stopped or killed during localization. As a result, the HDFS file remains in the downloaded state. A non-zero reference count means that no other container is currently using the resource, which means that the resource cannot be deleted.Copy the code

The 2018-08-25 19:15:38, 592 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Attempt to remove resource: { { hdfs://mycluster/user/hdfs/.staging/application_1444990016246_29569/libjars/avro-mapred-hadoop2.jar, 1448139497492, FILE, null },pending,[],920074451410033,DOWNLOADING} with non-zero refcount

3. The mission has beenkillSo a CancellationException is reportedCopy the code

The 2018-08-25 19:25:34, 592 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: {… }failed; java.util.concurrent.CancellationException

4. After a period of time, the status changes from DOWNLOADING to FAILED and HDFS resources can be deletedCopy the code

The 2018-08-25 20:15:38, 592 INFO org. Apache. Hadoop. Yarn. The server nodemanager. Containermanager. Localizer. LocalizedResource: The Resource HDFS: / / mycluster/user/HDFS /. The staging/application_1444990016246_29569 / libjars/avro – mapred – hadoop2. Jar (- > / data/nano – DIR/local Usercache/hadoop/filecache / 5432524 / avro-mapred-hadoop2.jar) from download to FAILED conversion 5. Delete locally cached file (possibly corrupted)

2018-08-25 19:15: 38592 INFO org, apache hadoop. Yarn. Server. The nodemanager. Containermanager. Localizer. LocalResourcesTrackerImpl: Removed/data/nm-local-dir/usercache/Hadoop/filecache / 5432524 / Avro-mapred-hadoop2.jar from localized cache ‘ ‘6. The requested resource is not in the cache and will be rerequested

The 2018-08-25 19:15:38, 592 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl:Container container_152345432_4324_3_4324234 sent RELEASE event on a resource request {hdfs://mycluster/user/hdfs/.staging/application_1444990016246_29569/libjars/avro-mapred-hadoop2.jar,,,} not presentin cache
Copy the code

The reason summary

The Container is stopped because the RPC between the Container and external components fails or the task is killed. The HDFS resource cannot be deleted abnormally and the Container keeps trying to delete it

The solution

1.Low method: manually delete files in HDFS that cannot be deleted (difficult to implement, do not know which files to delete and many times troublesome operation)

2. High-end way: the location of the abnormal LocalResourcesTrackerImpl (line339)

public boolean remove(LocalizedResource rem, DeletionService delService) {
// current synchronization guaranteed by crude RLS event for cleanup
   LocalizedResource rsrc = localrsrc.get(rem.getRequest());
   if (null == rsrc) {
     LOG.error("Attempt to remove absent resource: " + rem.getRequest()
         + " from " + getUser());
     return true;
   }
   if(rsrc.getRefCount() > 0 || ResourceState.DOWNLOADING.equals(rsrc.getState()) || rsrc ! = rem) { // internal error LOG.error("Attempt to remove resource: " + rsrc
         + " with non-zero refcount");
     return false;
   } else { // ResourceState is LOCALIZED or INIT
     localrsrc.remove(rem.getRequest());
     if (ResourceState.LOCALIZED.equals(rsrc.getState())) {
       delService.delete(getUser(), getPathToDelete(rsrc.getLocalPath()));
     }
     decrementFileCountForLocalCacheDirectory(rem.getRequest(), rsrc);
     LOG.info("Removed " + rsrc.getLocalPath() + " from localized cache");
     return true; }}Copy the code

ResourceState. DOWNLOADING. Equals (RSRC getState ()) file status for DOWNLOADING is an error, can delete this condition in the source code. Reference to add patch: issues.apache.org/jira/browse… Issues.apache.org/jira/secure…

3. Invincible method: Restart dafa… Restarting nodeManager and Spark automatically failover tasks, which do not affect online services

Conclusion:

The problem has nothing to do with resource allocation or resource usage of the Container, because nodeManager CPU is high, not the Container. The cause of this problem is that the Container starts to initialize and pull dependent resources from HDFS to the local computer when the task is submitted. At this time, the task hangs or the Container hangs (manually or timed out), and no other Container is using the resource. In this case, the resource stays in the DownLoading state and the second error is displayed. In normal cases, ignore this error. After a period of time, the system changes DownLoading to Failed and directly deletes the resource. However, I have observed that there are too many files in the DownLoading state, and the status conversion speed is very slow and even cannot be successfully converted. As a result, it cannot be deleted. A large number of errors similar to “2” appear in the log and the CPU is pulled too high.

Comments can not be timely reply to the public can directly add questions or exchanges, know all answers, thank you.