Java Debug: Hadoop MR GC

This article is participating in the Java Theme Month – Java Debug Notes Event,Details see

1 shows

It was interesting to find that with limited resources, it was easy to get access to a lot of questions. This time, I mainly built a Hadoop small environment for testing, with very few resources. 1 machine 2C4GCopy the code

2 GC abnormal

Java heap space
- A heap overflow occurs during the MAP phase
- This usually happens when Xmx is insufficient and there is no room for newly created objects in either the Yang or the old section

Then I typed the GC log
- In the end there were several Full GC events but there was nothing in the heap, only 2%
- Ps: When I was writing this article, I found that all of them are Allocation failures. It turns out that there are always some things that cannot be filled in
  - Allocation Failure indicates a Failure to allocate memory to a new object

[PSYoungGen: 14848K->2038K] 14848K->2606K(55296K), 0.0026520 secs] [Times: User =0.01 sys=0.00, real=0.00 secs] [GC (Allocation Failure) [PSYoungGen: 16886K->2041K(31744K)] 17454K->4037K(70144K), 0.0034995secs] [Times: User =0.00 sys=0.00, real=0.00 secs] [GC (Allocation Failure) [PSYoungGen: [Times: Times] 31737K-> 318k (31744K)] 31733k -> 3181k (31744K), 0.0052218secs] [Times: User =0.01 sys=0.01, real=0.01 secs] [GC (Allocation Failure) [PSYoungGen: [Times: Times: 31717K-> 317k (31440k)] 31747K -> 3178k (31440k), 0.0072265secs] [Times: 31717K] User =0.01 sys=0.00, real=0.00 secs] [GC (Metadata GC Threshold) [PSYoungGen: Secs] [Times: 3030K -> 3030k] 3030k -> 3030k [Times: 3030k] 3030k -> 3030k [Times: 3030k] User =0.01 sys=0.00, real=0.01 secs] [Full GC (Metadata GC Threshold) [PSYoungGen: 2021K->0K(61440K)] [ParOldGen: 10897K->8623K(35328K)] 12918K->8623K(96768K), [Metaspace: 20839K->20839K(1069056K)], 0.0277290secs] [Times: User =0.05 sys=0.00, real=0.02 secs] [GC (Allocation Failure) [PSYoungGen: [Times: Times: 328k -> 328K [Times: 328K -> 328K] 328K -> 328k [Times: 328k] User =0.01 sys=0.01, real=0.01 secs] [GC (Allocation Failure) [PSYoungGen: [Times: Times: 3285K -> 3285k] 3285k -> 3285k [Times: 3285k] 3285k -> 3285k [Times: 3285k] 3285k -> 3285k [Times: 3285k] 3285k -> 3285k [Times: 3285k] User =0.00 sys=0.00, real=0.00 secs] [GC (Allocation Failure) [PSYoungGen: Secs] [Times: Times: 328k -> 328K [Times: 328k] 328k -> 328k [Times: 328k] 328k -> 328k [Times: 328k] 328k -> 328k [Times: 328k] 328k -> 328k [Times: 328k] User =0.01 sys=0.01, real=0.01 secs] [Full GC (Allocation Failure) [PSYoungGen: 4003K->0K(135168K)] [ParOldGen: 8631K->8546K(39424K)] 12634K->8546K(174592K), [Metaspace: 28225K->28225K(1075200K)], 0.0297214 secs] [Times: User =0.04 sys=0.00, real=0.03 secs] [GC (Allocation Failure) [PSYoungGen: 0K->0K(135168K)] 046K -> 046K (484864K), 0.0005536secs] [Times: Sys =0.00, real=0.00 secs] [Full GC (Allocation Failure) [PSYoungGen: 0K->0K(135168K)] [ParOldGen: [Metaspace: 28225K->28214K(1075200K)], secs] [Times: 28225K->28214K(1075200K)] User =0.07 sys=0.00, real=0.04 secs] Heap PSYoungGen Total 135168K, used 9416K [0x00000000F5580000, 0x0000000100000000, 0x0000000100000000) eden space 128512K, 7%, informs [x00000000f5580000 0, 0 x00000000f5eb21d8, 0 x00000000fd300000) from space 6656 k, 0%, informs [x00000000fd980000 x00000000fd300000 0, 0 x00000000fd300000, 0) to space 7168 k, 0%, informs [x00000000ff900000 0, 0 x00000000ff900000, 0 x0000000100000000) ParOldGen total 349696 k, used 8057K [0x00000000e0000000, 0x00000000f5580000, 0x00000000f5580000) object space 349696K, 2%, informs [x00000000e0000000 0, 0 x00000000e07de4f8, 0 x00000000f5580000) Metaspace informs the 28262 k, capacity 28534 k, committed 28928K, reserved 1075200K class space used 3154K, capacity 3259K, committed 3328K, reserved 1048576KCopy the code

3 Key Configuration

etc/hadoop/yarn-site.xml
- Resources are relatively small
- It should have been in the previous debuggingminimumTask stuck in map 0%
- After checking the log, IT was found that AppMst applied for 4G memory and could not give it all the time

  <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>3172</value>
  </property>
  <! Default minimum memory allocation per container: 1024 -->
  <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>512</value>
  </property>
  <! Max memory allocation per container default: 8192 -->
  <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>3172</value>
  </property>
Copy the code

etc/hadoop/mapred-site.xml
- The OOM configuration is generated

<! -- Map operation Max memory default: 1536 1.5G -->
  <property>
        <name>mapreduce.map.memory.mb</name>
        <value>512</value>
  </property>
  <! -- Map operation Java parameter default: -xmx1024m -->
  <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx512M</value>
  </property>
  
  <! Default: 3072 3G -->
  <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>1024</value>
  </property>
  <! -- Reduce operation Java parameter default: -xmx2560m -->
  <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx1024M</value>
  </property>
  
  <! Default: 512 -->
  <property>
        <name>mapreduce.task.io.sort.mb</name>
        <value>512</value>
  </property>
Copy the code

4 Find out what the problem is

To modify themapred-site.xmlIncreased map heap memory and memory to 1024
The problem was discovered by looking at the GC log (GC troubleshooting is not very familiar at this point)
- ParOldGen total 560128K, used 532981K
- 560128 K = 512 MB
- An object has requested 500 MB of memory. .

{Heap before GC invocations=7 (full 1): PSYoungGen total 61440K, used 59392K [0x00000000eab00000, 0x00000000f2c00000, 0x0000000100000000) eden space 59392K, 100%, informs [x00000000eab00000 0, 0 x00000000ee500000, 0 x00000000ee500000) from space 2048 k, 0%, informs [x00000000ee700000 x00000000ee500000 0, 0 x00000000ee500000, 0) to space 6656 k, 0%, informs [x00000000f2580000 0, 0 x00000000f2580000, 0 x00000000f2c00000) ParOldGen total 35328 k, used 8685K [0x00000000c0000000, 0x00000000c2280000, 0x00000000eab00000) object space 35328K, 24%, informs [x00000000c0000000 0, 0 x00000000c087b598, 0 x00000000c2280000) Metaspace informs the 24829 k, capacity 25096 k, Committed 25472K, Reserved 1071104K Class space used 2792K, Capacity 2882K, 2944K, reserved 1048576K 1.426: [Allocation Failure (GC) [PSYoungGen: 59392K->5394K(95744K)] 68077K->14087K(131072K), 0.0046666 secs] [Times: User =0.01 sys=0.00, real=0.00 secs] Heap after GC invocations=7 (full 1): PSYoungGen total 95744K, used 5394K [0x00000000eab00000, 0x00000000f2b00000, 0x0000000100000000) eden space 90112K, 0%, informs [x00000000eab00000 0, 0 x00000000eab00000, 0 x00000000f0300000) from space 5632 k, 95%, informs [x00000000f2b00000 x00000000f2580000 0, 0 x00000000f2ac48b0, 0) to space 7168 k, 0%, informs [x00000000f1d00000 0, 0 x00000000f1d00000, 0 x00000000f2400000) ParOldGen total 35328 k, used 8693K [0x00000000c0000000, 0x00000000c2280000, 0x00000000eab00000) object space 35328K, 24%, informs [x00000000c0000000 0, 0 x00000000c087d598, 0 x00000000c2280000) Metaspace informs the 24829 k, capacity 25096 k, committed 25472K, reserved 1071104K class space used 2792K, capacity 2882K, committed 2944K, reserved 1048576K } Heap PSYoungGen total 95744K, used 67268K [0x00000000eab00000, 0x00000000f2b00000, 0x0000000100000000) eden space 90112K, 68%, informs [x00000000eab00000 0, 0 x00000000ee76ca98, 0 x00000000f0300000) from space 5632 k, 95%, informs [x00000000f2b00000 x00000000f2580000 0, 0 x00000000f2ac48b0, 0) to space 7168 k, 0%, informs [x00000000f1d00000 0, 0 x00000000f1d00000, 0 x00000000f2400000) ParOldGen total 560128 k, used 532981K [0x00000000c0000000, 0x00000000e2300000, 0x00000000eab00000) object space 560128K, 95%, informs [x00000000c0000000 0, 0 x00000000e087d5a8, 0 x00000000e2300000) Metaspace informs the 29087 k, capacity 29334 k, committed 29824K, reserved 1075200K class space used 3248K, capacity 3333K, committed 3456K, reserved 1048576KCopy the code

5 The faulty configuration is found

etc/hadoop/mapred-site.xml
- 512 MB will cause the Map Task to request such a large memory array on the Map node for sorting and merging purposes
- See how the source code is done

  <! Default: 512 -->
  <! -- Map phase sort will use this value to occupy heap memory, please note -->
  <property>
        <name>mapreduce.task.io.sort.mb</name>
        <value>512</value>
  </property>
Copy the code

6 Source code Analysis

MRJobConfig.IO_SORT_MB
- Configuration items

public interface MRJobConfig {
	String IO_SORT_MB = "mapreduce.task.io.sort.mb";
  public static final int DEFAULT_IO_SORT_MB = 100;
}
Copy the code

MapTask
- Looks like part of the ring buffer

// take 16 bytes; Meta looks at the pixel number
private static final int NMETA = 4;            // num meta ints
private static final int METASIZE = NMETA * 4; // size in bytes


// The default is 100MB
final int sortmb = job.getInt(MRJobConfig.IO_SORT_MB,
          MRJobConfig.DEFAULT_IO_SORT_MB);

/ / the left 20; 100 << 20 = 104857600; The magic operation, equivalent to 100*1024*1024, calculates the number of bytes
int maxMemUsage = sortmb << 20;
// Subtract something 100M - (100M % 16b); It's 100M minus 0; Look back here
Result of 'maxMemUsage % METASIZE' is always '0'
// I don't know what to do here
maxMemUsage -= maxMemUsage % METASIZE;
// This array is the reason for OOM
kvbuffer = new byte[maxMemUsage];

// This is where 80% of the ring buffer comes from spill
final float spillper = job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);
softLimit = (int)(kvbuffer.length * spillper);
Copy the code

Application log

mapreduce.task.io.sort.mb: 512
soft limit at 429496736
bufstart = 0; bufvoid = 536870912
kvstart = 134217724; length = 33554432
Copy the code

1 shows

2 GC abnormal

3 Key Configuration

4 Find out what the problem is

5 The faulty configuration is found

6 Source code Analysis

Related Posts

Pandas is getting harder and harder to learn. You have to make fun of it for yourself

Cache and database data consistency issues

【 Cloud computing 1024 kinds of gameplay 】 Skillfully use the move cloud tool to easily achieve the server migration to ECS