Have you ever written a Go program that disappears in production? It’s a magical problem that most people have never encountered before. Unfortunately, I happened to encounter it, but at that time only to restart the method (although panic is not good). And among them reason, after meeting 3 times, finally painful determination seeks among them reason.

The program was compiled using Go 1.5 and ran on a CentOS release 6.3 (Final) 2.6.32-279.el6.x86_64 machine. For each error, the following error is reported:

fatal error: bad map state


goroutine 89 \[running\]:
runtime.throw(0x990ca0, 0xd)
        /usr/local/go/src/runtime/panic.go:527 +0x90 fp=0xc8323e9bb0 sp=0xc8323e9b98
runtime.evacuate(0x803440, 0xc8200f4c30, 0x6b)
        /usr/local/go/src/runtime/hashmap.go:825 +0x3b1 fp=0xc8323e9c70 sp=0xc8323e9bb0
runtime.growWork(0x803440, 0xc8200f4c30, 0xa5)
        /usr/local/go/src/runtime/hashmap.go:795 +0x83 fp=0xc8323e9c90 sp=0xc8323e9c70
runtime.mapassign1(0x803440, 0xc8200f4c30, 0xc8323e9d60, 0xc8323e9d70)
        /usr/local/go/src/runtime/hashmap.go:433 +0x176 fp=0xc8323e9d38 sp=0xc8323e9c90
Copy the code

Looking at this error, how to locate the problem? Start by analyzing the context of the code for hashmap.go:825.

for ; b != nil; b = b.overflow(t) {
      k := add(unsafe.Pointer(b), dataOffset)
      v := add(k, bucketCnt\*uintptr(t.keysize))
      for i := 0; i < bucketCnt; i, k, v = i+1, add(k, uintptr(t.keysize)), add(v, uintptr(t.valuesize)) {
        top := b.tophash\[i\]
        if top == empty {
          b.tophash\[i\] = evacuatedEmpty
          continue
        }
        if top < minTopHash {
          throw("bad map state")   <----- 825 行
        }
        ...
     }
     ...
}
Copy the code

The mapping bucket overflows on NaCl/ AMD64p32. But why does it spill over?

On most systems, Pointers are poor alignment, so adding a pointer field to the end of a structure ensures that no padding is added after that field (there are still some fields that need more alignment, so the entire structure can be aligned).

At run time, map needs a fast way to get the overflow pointer, the last pointer in its bucket structure, so it uses size-sizeof as the offset.

NaCl/ AMD64P32 is the exception, as always with the alignment being 64 bits, but the pointer being 32 bits. There’s a long history of this not worth delving into, but when we move the overflow pointer to the end of the structure, we don’t get the right result. The compiler calculates the size of the regular structure and then adds another 32 bits on top of amD64p32. The runtime assumes that it can return two 32-bit bits (a 64-bit register size) to reach the overflow pointer.

But in practice, if the structure requires 64-bit alignment, the normal structure size calculation already adds a 32 bit, and then the code adds a second 32 bit unconditionally. This places the overflow pointer at the end of the 3 word (1 word = 2 bytes = 2 * 16 bits), not after the 2 word. The last two words are padding, and since the runtime consistently uses the penultimate word as an overflow pointer, it does no harm in the sense of overwriting useful memory. But writing an overflow pointer to a non-pointer word in memory means that the GC can’t see the overflow block, so it collects it prematurely, leading to an error.

A pointer to a non-pointer word of memory that overflows a block may not be seen during GC. This then leads to crashes (currently found in NaCl/ AMD64P32).

The solution

This article provides three alternatives

  • Upgrade Golang build version 1.5->1.8 (required)

  • Adding Process Monitoring

  • Added the Supervior administration process

How to fix

The upgrade was inevitable, but how did Go officials fix this in subsequent releases?

  • An explicit check is added at the end of the compiler’s bucket layout to ensure that the overflow field is the last field in the structure and is never populated.

  • When padding is needed on NaCl (and only when needed), it is inserted before the overflow pointer to preserve the “last in the structure” property.

  • Let the compiler have the final say on the width of the structure by inserting an explicit fill field, rather than overwriting the results of the width calculation it does.

  • For the same reason (we need to tell the compiler the truth), set the type of the overflow field when we try to pretend that the overflow field is not a pointer (in this case, the runtime maintains a list of overflow blocks elsewhere).

  • Cause the runtime to use “last in structure” as its localization algorithm.

Welcome to follow my technical public number

Join a community

In the public account background reply keyword [dubbogo] to join dubbo-go community.