This time I went to Gopher China and met many old friends. There are also many net friends I haven’t met in WeChat for a long time. At the same time, I also chatted with front-line developers of various companies to exchange some of their experiences and pain points when using Go.

In this post, I’ve put together some of the less common optimizations and hacks I’ve seen so far on Go. Because some companies are special, the dots listed in this article are not identified by the company. In the future, they will come out and do some sharing on their own when the time is right.


The current Go network abstraction is somewhat inefficient, with at least one goroutine maintained for each connection, and some protocol implementations may have two. So the total number of goroutines = 1 connections or 2 connections. When connections exceed 10W, the goroutine stack itself generates several gigabytes of memory consumption.

The large number of goroutines can also put a lot of pressure on both scheduling and GC. The efficiency of the network library and encryption library at the bottom of Go is not very high, so there is a big performance gap (memory, throughput) between it and C++ and other languages in similar applications.

There are quite a few libraries in the community that use bare-tuned ePoll to optimize, so I won’t advertise them. Because the user’s syscall.EpollWait is running in a goroutine without any priority, when CPU idle is low, the overall latency of the system is uncontrollable, which is much higher than that of the standard library. I have done relevant tests before, and the increase of the number of cores will make the corresponding delay of the system rise significantly.

Then the optimization ideas of different companies go in two directions:

  1. Modify Runtime to add a callback to the user’s Epoll function at Runtime. Similar to NetPoll implemented by Runtime itself. This approach makes Go itself difficult to upgrade and must follow a hack version. It can be awkward when you run into Go bugs.
  2. The network library implemented by C is regarded as the basic component, and the business logic implemented by Go is grafted onto the C library as the business.

It’s a bit sad that the Go language, which is great for web programming, is now forcing people to hack Runtime for optimization, or even to graft to other languages. Hopefully, in the future, there will be better infrastructure tools to support this kind of high connection scenario.


Because of the long history of C, the use of CGO is unavoidable. I see a lot of them. For example, the ICU library that must be used for internationalization needs can only be used for CGO adjustment. Or do CV OpenCV, also can only use CGO. Or some national secret scenes, also can only use CGO.

But from Go to C, there is a stack switch, which is a bit expensive, so some companies have implemented the magic call from Go to C without a stack switch, and at the same time mark the stack so that the GC can scan only the Go stack and not the C stack.


The Go Plugin provided by the authorities is still difficult to use, such as the compilation version must be unified; Unable to unload after loading.

Now there are companies based on. Got table, to achieve more flexible than the official plugin hot load, hot unload the dynamic library.

(I don’t know much about this, so I just heard, if you have any questions, please point out

Assembly optimization

The compiled backend of the Go language is implemented by Go itself, without the help of older platforms such as LLVM.

Some people compile the equivalent code written in C language with a higher optimization level, such as Clang-O3, into a highly optimized assembly, then translate it into Plan9 assembly, and integrate it into functions for GO application to call. In this way, it is equivalent to enjoying the back-end optimization results of LLVM platform in GO.

The runtime changes

In addition to the network programming mentioned earlier, eopollwait requires high-priority goroutines, and similar high-priority goroutines are also required in other applications involving task distribution and task processing.

So some companies provide interfaces directly at Runtime that allow users to create special priority goroutines through interfaces.

In addition to exposing the interface, the implementation code for the Runtime also requires a number of changes.

Static checks by SSA

As we know, golangci-lint in the current community mostly uses Go’s built-in compilation front end to do some logic on the compiled AST to alert users to possible problems with their code.

In some lower-level programming scenarios, you want to eliminate all heap allocations, so they help optimize the code by checking if there is a call to NewObject in the generated SSA.

The garbage collection

There is no generation for garbage collection of the GO language, which involves object movement between different generations. Judging from the sharing of the official history of GO, the developers of GO are quite persistent in non-moving. If objects on the heap can move, it is necessary to open a “read barrier” when reading objects. Almost all scenarios are scenarios with more reads and less writes, which will greatly affect the performance of the program. Therefore, the authorities are now at a standstill in the development of generation generation.

A company in China has implemented generational garbage collection on top of the previous official generational GC CL, but it’s not clear whether they will eventually achieve a good performance improvement in a production environment because we can’t see the code.


The content in this article is a simple introduction, not too many details, if the reader is more interested, we will expand on each point to talk about ~