The migration from CSDN:blog.csdn.net/erlib/artic…

Original text: bravenewgeek.com/so-you-wann…

By now, I’ve forgotten what I’m writing, but I’m sure this article is about the Go language. This is primarily an article about speed, not speed of development — there is a difference between the two speeds.

I’ve worked with a lot of smart people. A lot of us are obsessed with performance, and what we’ve been doing is trying to get close to the limits of what can be expected. The application engine has some very stringent performance requirements, which is why we made the change. Since using Go, we’ve learned a lot about how to improve performance and make Go work in system programming.

Go’s simplicity and native concurrency make it a very attractive back-end development language, but the bigger question is how will it handle delay-sensitive application scenarios? Is it worth sacrificing the brevity of language to make it faster? Let’s take a look at several aspects of Go language performance optimization: language features, memory management, concurrency, and make appropriate optimization decisions based on them. All of the test code described here is here.

Channels

Channel gets a lot of attention in the Go language because it’s a convenient concurrency tool, but it’s also important to understand its impact on performance. Its performance is “good enough” in most scenarios, but it can be a bottleneck in some time-sensitive scenarios. Channel is not dark magic. In the underlying implementation of a Channel, locks are used again. It works fine in single-threaded applications with no lock contention, but performance degrades dramatically in multi-threaded scenarios. We could easily use a lockless queue ring buffer instead of channel.

The first performance test compared a single-threaded buffer channel with a ring buffer(one producer and one consumer). Let’s look at the single core case (GOMAXPROCS = 1)

BenchmarkChannel 3000000 512 NS /op BenchmarkRingBuffer 20000000 80.9 ns/opCopy the code

As you can see, the Ring Buffer is about six times faster (if you’re not familiar with Go’s performance testing tool, the middle number indicates the number of executions, and the last array indicates the time each execution takes). Now, let’s look at adjusting GOMAXPROCS = 8.

BenchmarkChannel-8 3000000 542 ns/op
BenchmarkRingBuffer-8 10000000 182 ns/op
Copy the code

The ring buffer was nearly three times faster

Channels are usually used to assign tasks to workers. In the following test, we compare multiple readers reading the same channel or ring buffer. GOMAXPROCS = 1 test results show that channel performs particularly well in single-core applications.

BenchmarkChannelReadContention 10000000 148 ns/op
BenchmarkRingBufferReadContention 10000 390195 ns/op
Copy the code

However, ring buffers are faster with multiple cores:

BenchmarkChannelReadContention-8 1000000 3105 ns/op
BenchmarkRingBufferReadContention-8 3000000 411 ns/op
Copy the code

Finally, let’s look at multiple readers and multiple writers. The comparison below also shows that the Ring buffer is better with multiple cores.

BenchmarkChannelContention 10000 160892 ns/op
BenchmarkRingBufferContention 28068 34344 ns/op
BenchmarkChannelContention-8 5000 314428 ns/op
BenchmarkRingBufferContention-8 10000 182557 ns/op
Copy the code

The Ring buffer is thread-safe using only CAS operations. As we can see, the choice between channel and ring buffer depends largely on the number of cores in the system. For most systems, GOMAXPROCS> 1, so a lock-free ring buffer is often a better choice. Channel is a poor choice for multi-core systems.

defer

Defer is a very useful keyword to improve readability and avoid resource unfreed. For example, when we open a file for reading, we need to close it when we finish reading. Without the defer keyword, we must ensure that the file is closed before each return point of the function.

func findHelloWorld(filename string) error { file, err := os.Open(filename) if err ! = nil { return err } scanner := bufio.NewScanner(file) for scanner.Scan() { if scanner.Text() == "hello, world!" { file.Close() return nil } } file.Close() if err := scanner.Err(); err ! = nil { return err } return errors.New("Didn't find hello world") }Copy the code

This is very error-prone, because it is easy to forget to close the file before any return statement. Defer solves this problem with a single line of code.

func findHelloWorld(filename string) error { file, err := os.Open(filename) if err ! = nil { return err } defer file.Close() scanner := bufio.NewScanner(file) for scanner.Scan() { if scanner.Text() == "hello, world!" { return nil } } if err := scanner.Err(); err ! = nil { return err } return errors.New("Didn't find hello world") }Copy the code

At first glance, one might think deferred Ming would be completely optimized by the compiler. If I had just used the defer statement at the beginning of the function, the compiler could have done that by inserting the defer content before each return statement. But the reality is often more complicated. For example, we could add defer in a conditional statement or loop. The first case might require the compiler to find the conditional branch that applies the defer statement. The compiler also needs to check for panic, as this is also a case where the function exits execution. Providing this functionality through static compilation (defer) seems (at least on the surface) unlikely.

Derfer is not a zero-cost keyword, as we can see through performance testing. In the following tests, we compared a mutex that was locked in the body of the loop and unlocked directly with the defer statement.

BenchmarkMutexDeferUnlock - 8-20000000 96.6 ns/op BenchmarkMutexUnlock 100000000-8-19.5 ns/opCopy the code

Using defer is almost five times slower. To be fair, 77ns May not be that important, but it does make a difference in performance over a loop. It’s usually up to the developer to make a trade-off between performance and the legibility of the code. Optimizations are always at a cost (in Go1.9 defer performance was roughly 50-60ns/op).

Json and reflection

Reflection is generally slow and should be avoided in delay-sensitive services. JSON is a common data interchange format, but Go’s Encoding/JSON library relies on reflection to serialize and deserialize JSON. With FFJSON, we can avoid the use of reflection by using code generation. Here is the performance comparison.

BenchmarkJSONReflectionMarshal-8 200000 7063 ns/op
BenchmarkJSONMarshal-8 500000 3981 ns/op

BenchmarkJSONReflectionUnmarshal-8 200000 9362 ns/op
BenchmarkJSONUnmarshal-8 300000 5839 ns/op
Copy the code

(FFJSON) generates JSON serialization and deserialization about 38% faster than the reflection-based standard library. Of course, if we are really demanding codec performance, we should avoid using JSON. MessagePack is a better choice for serialization code. In this test we used the MSGP library to compare with JSON.

BenchmarkMsgpackMarshal-8 3000000 555 ns/op BenchmarkJSONReflectionMarshal-8 200000 7063 ns/op BenchmarkJSONMarshal-8 500000 3981 ns/op BenchmarkMsgpackUnmarshal - 8-20000000 94.6 ns/op BenchmarkJSONReflectionUnmarshal - 8 200000 9362 ns/op BenchmarkJSONUnmarshal-8 300000 5839 ns/opCopy the code

The difference here is significant. MessagePack is still much faster, even compared to the code generated by ffJSON.

If we really care about minor optimizations, we should also avoid using the Interface type, which requires some extra processing in serialization and deserialization. In some dynamic invocation scenarios, run-time invocation also adds some overhead. The compiler cannot replace these calls with inline calls.

BenchmarkJSONReflectionUnmarshal-8 200000 9362 ns/op
BenchmarkJSONReflectionUnmarshalIface-8 200000 10099 ns/op
Copy the code

Let’s look at call lookup, which converts an interface variable to its real type. This test calls the same method on the same struct. The difference is that the second variable is a pointer to the structure.

BenchmarkStructMethodCall - 8-2000000000 0.44 ns/op BenchmarkIfaceMethodCall 1000000000-8-2.97 ns/opCopy the code

Sorting is a more practical example that shows the performance difference. In this test, we compare and sort 1,000,000 constructs with 1,000,000 interfaces pointing to the same constructs. Sorting structures is 63% faster than sorting interfaces.

BenchmarkSortStruct-8 10 105276994 ns/op
BenchmarkSortIface-8 5 286123558 ns/op
Copy the code

In general, avoid USING JSON if possible. If you do need JSON, generate serialization and deserialization code. In general, it’s best to avoid relying on reflections and interfaces and instead write specific types to use. Unfortunately, this tends to result in a lot of duplicate code, so it’s best to generate this code abstractly. Third, weigh the pros and cons.

Memory management

Go does not actually expose the heap or allocate the stack directly to the user. In fact, the words heap and stack do not appear anywhere in the Go language specification. This means that things about stacks and heaps are only technically relevant. In fact, each Goroutine does have its own heap and stack. The compiler does have to analyze to determine whether objects are allocated on the stack or in the heap.

Not surprisingly, avoiding heap allocation can be a major direction of optimization. By allocating space on the stack (that is, creating objects using A{} rather than new(A)), we avoid expensive malloc calls, as shown in the test below.

BenchmarkAllocateHeap-8 20000000 62.3 NS/OP 96 B/ OP 1 ALLOCS/OP BenchmarkAllocateStack-8 100000000 11.6 NS /op 0 B/ OP 0 allocs/opCopy the code

Naturally, passing a value through a pointer is faster than passing it through an object, because the former requires copying a single pointer, whereas the latter requires copying the entire object. The difference in the following test results is almost negligible, because the difference largely depends on the type of object being copied. Note that there may be some compiler optimizations for this test.

BenchmarkPassByReference-8 1000000000 2.35 ns/op
BenchmarkPassByValue-8 200000000 6.36 ns/op
Copy the code

However, the biggest problem with heap space allocation is GC (garbage collection). If we generate a lot of objects with a short lifetime, we trigger GC work. This is where object pooling comes in handy. In the following tests, we compared using heap allocation to using sync.pool. Object pooling improves performance by a factor of five.

BenchmarkConcurrentStructAllocate - 8, 5000000, 337 ns/op BenchmarkConcurrentStructPool 20000000-8-65.5 ns/opCopy the code

It should be noted that Go’s SYsc.pool is also collected during garbage collection. The purpose of using sync.pool is to reuse memory between garbage collection operations. We could also maintain our own list of free objects so that they are not collected, but that would make garbage collection useless. Go’s pprof tool is very useful in analyzing memory usage. Be sure to use it for analysis before blindly doing memory optimization.

Affinity cache

When performance really matters, you have to start thinking at the hardware level. The famous Formula One driver Jackie Stewart once said, “You don’t have to be an engineer to be a racing driver, but you do have to have mechanical knowledge.” Understanding the inner workings of a car can make you a better driver. Similarly, understanding how computers work can make you a better programmer. For example, how is memory laid out? How does CPU caching work? How does the hard drive work?

Memory bandwidth is still a limited resource for modern cpus, so caching is extremely important to prevent performance bottlenecks. Modern multicore processors cache data in a cache line, typically 64 bytes in size, to reduce expensive main memory access. To ensure cache consistency, even a small write to memory will render the cache line obsolete. Read operations on adjacent addresses cannot hit the corresponding cache line. This phenomenon is called false sharing. This problem becomes apparent when multiple threads access different data in the same cache line.

Imagine how a struct in Go might be stored in memory. Using the previous ring buffer as an example, the structure might look like this:

type RingBuffer struct {  
    queue          uint64  
    dequeue        uint64  
    mask, disposed uint64  
    nodes          nodes  
}  
Copy the code

The Queue and deQueue fields are used to determine the location of the producer and consumer, respectively. These fields are all 8 bytes in size and are accessed and modified concurrently by multiple threads to implement queue insertion and deletion operations. Since these fields are contiguous in memory and use only 16 bytes of memory, they are most likely stored on the same cache line. So modifying any one of these fields will cause the other fields to be cached out, which means that subsequent reads will be slower. That is, adding and removing elements to the ring buffer results in a lot of CPU cache invalidation.

We can add padding directly to the structure’s fields. Each padding is the size of a CPU cache line, which ensures that the ring buffer fields are cached in different cache lines. Here is the modified structure:

type RingBuffer struct {  
    _padding0      [8]uint64  
    queue          uint64  
    _padding1      [8]uint64  
    dequeue        uint64  
    _padding2      [8]uint64  
    mask, disposed uint64  
    _padding3      [8]uint64  
    nodes          nodes  
}  
Copy the code

How much difference does it actually make at run time? As with any optimization, it depends on the actual situation. It is related to the number of CPU cores, the number of competing resources, and the layout of memory. Although there are many factors to consider, we still have to let the numbers speak for themselves. We can compare the ring buffer with and without the padding.

First, we test the case of a producer and a consumer, running separately in a Gorouting. In this test, the difference was very small, with a performance improvement of less than 15% :

BenchmarkRingBufferSPSC-8 10000000 156 ns/op
BenchmarkRingBufferPaddedSPSC-8 10000000 132 ns/op
Copy the code

But when we have multiple producers and consumers, say 100 each, the differences become more pronounced. In this case, the filled version is about 36% faster.

BenchmarkRingBufferMPMC-8 100000 27763 ns/op
BenchmarkRingBufferPaddedMPMC-8 100000 17860 ns/op
Copy the code

False sharing is a very real problem. Padding is added to mitigate the impact of concurrency and memory contention. These numbers may seem trivial, but it’s already optimized, especially when you consider the clock cycle.

Unlocked share

Lock-free data structures are important to take full advantage of multiple cores. Given that Go is committed to high-concurrency usage scenarios, it discourages the use of locks. It encourages the use of channels rather than mutex.

That said, the library does provide common memory-level atomic operations, such as atomic packages. It provides atom comparison and exchange, atom pointer access. However, the use of atomic packs is largely discouraged:

We generally don’t want sync/atomic to be used at all… Experience has shown us again and again that very very few people are capable of writing correct code that uses atomic Operations… If we had thought of internal packages when we added the sync/atomic package, Perhaps we would have used that. Now we can’t remove the package because of the Go 1 guarantee.

How difficult is it to achieve lock-free? Is it just a few CAS implementations? After learning enough, I realized that this is definitely a double-edged sword. Lockless code can be quite complex to implement. The Atomic and Unsafe packages are not easy to use. Also, writing thread-safe lockless code is tricky and error-prone. A simple, lock-free data structure like the Ring Buffer is relatively easy to maintain, but other scenarios can be problematic.

Ctrie is an implementation of a lock-free data structure that I wrote, and it’s explained in detail here. Although it’s easy to understand in theory, it’s actually quite complex to implement. The main reason for the complex implementation is the lack of dual CAS, which can help us automatically compare nodes (to detect node mutations on the tree) and also help us to generate snapshots of nodes. Since there is no hardware for this operation, we need to simulate it ourselves.

The first version of Ctrie was a terrible failure, not because I misused Go’s synchronization mechanism, but because I made wrong assumptions about the Go language. Each node in Ctrie has a companion node associated with it. When a snapshot is taken, the root node is copied to a new node, and when a node in the tree is accessed, it is lazily copied to the new node (persistent data structure). Such snapshot operations are constant time consuming. To avoid integer overflows, we allocate objects on the heap to distinguish between old and new nodes. In Go, we use an empty struct. In Java, the two newly generated empty objects are different because they have different memory addresses, so I assume that the rules for Go are the same. However, the results are brutal, as can be seen in the following documentation:

A struct or array type has size zero if it contains no fields (or elements, respectively) that have a size greater than zero. Two distinct zero-size variables may have the same address in memory.

So the tragic thing happens, the two newly generated nodes are equal when compared, so the double CAS is always successful. This BUG is interesting, but tracking it in a high-concurrency, lockless environment is hell. If you don’t get it right the first time, it will take a lot of time to solve hidden problems, and it won’t always be right if you get it right the first time.

But obviously it makes sense to write complex lock-free algorithms, otherwise why would anyone do it? Ctrie takes more time to insert than it does to synchronize a map or skip list because there are more addressing operations. The real advantage of Ctrie is memory consumption. Unlike most Hash tables, it is always a series of keys in a tree. Another performance advantage is that it can take linear snapshots in constant time. We compared synchronized map and Ctrie snapshots under 100 concurrent conditions:

BenchmarkConcurrentSnapshotMap-8 1000 9941784 ns/op
BenchmarkConcurrentSnapshotCtrie-8 20000 90412 ns/op
Copy the code

Under certain access patterns, lockless data structures can provide better performance in multithreaded systems. For example, NATS message queues use synchronized Map-based data structures for subscription matching. If you use lock-free Ctrie, throughput increases considerably. In the time consuming figure below, the blue line represents an implementation using a locke-based data structure, and the red line represents an implementation without a lock





Avoiding locking in certain scenarios can provide a significant performance boost. From the comparison between ring buffer and channel, we can see the obvious advantages of lock-free structure. However, there is a trade-off between the complexity of coding and the benefits to be gained. In fact, sometimes lockless structures don’t provide any real benefits.

Considerations for optimization

As we saw from the discussion above, performance tuning always has a cost. Knowing and understanding optimization methods is just the first step. It is more important to understand when and where they should be used. To quote A famous quote from C. A. R. Hoare, which has become A classic maxim for all programmers:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

But the point of this statement is not to argue against optimization, but to learn to trade off speed — speed of the algorithm, speed of response, speed of maintenance, and speed of the system. This is a very subjective topic, and there is no simple standard. Is premature optimization a source of error? Should I implement features first and optimize later? Or is there no need to optimize at all? There is no right answer. Sometimes it’s ok to implement features first and then speed them up.

However, my advice is to optimize only the critical paths. The further you go down the critical path, the lower your return on optimization becomes to the point where it’s almost a waste of time. It is important to be able to make good judgments about whether performance is up to par. Don’t waste time beyond that. Be data-driven — speak from experience, not whim. And be practical. There is no point in tuning tens of nanoseconds off code that is not very sensitive for a while. There’s more to it than that.

conclusion

If you’ve read this far, congratulations, but you may still have some questions wrong. We’ve learned that in software we actually have two types of speed — speed of response and speed of execution. Users want the first, developers want the second, and Ctos want both. The first speed is now the most important, as long as you want users to use your product. The second type of speed requires scheduling and iteration. They are often in conflict with each other.

Perhaps more intriguingly, we discussed some ways that Go could improve performance and make it more usable on low-latency systems. The Go language is built for simplicity, but that simplicity sometimes comes at a cost. Just like the previous two speed trade-offs, there are trade-offs between code maintainability and code performance. Speed often means sacrificing code simplicity, more development time, and later maintenance costs. Choose wisely.

If you liked this article, please click Like; If you want to get the latest consultation, please click follow. Your support is the greatest encouragement to the author, very grateful! By Chris