The preliminary competition of tianchi middleware competition was finally officially ended this morning, and the public account stopped for a month, the main reason is that the free time of the blogger is almost all spent on this competition. The end of the first season, do the conclusion of the competition, in general, the harvest is not small.

Let’s start with the results. The final ranking of the list is 15th (excluding the two trumps in the front row, plus the first cheating, it is barely 12th), which is quite a satisfactory result to be honest. This article is intended for those of you who have used Netty in the competition but have not achieved your ideal QPS, who are just starting out with Netty, and who are interested in Dubbo Mesh.

Before the competition, MY personal understanding of Netty was only at the level of understanding. I had some knowledge about Netty transmission in the series of articles on the principle of RPC before, and I could basically use Netty to participate in the competition based on zero basis. Therefore, I will explain my optimization process from a small perspective. Step by step to improve YOUR QPS without avoiding the pitfalls and negative optimizations you’ve trod over. On the other hand, because my understanding of Netty is not very deep, so if there are mistakes in the article, please understand, welcome to correct.

What is a Dubbo Mesh?

For those readers who are not familiar with the content of this competition, I would like to introduce the concept of Dubbo Mesh in a few words.

If you’ve ever used Dubbo and know a thing or two about Service Mesh, you can easily understand what dubbo Mesh is all about. To be clear, Dubbo was originally intended for the Java language, with no cross-language considerations. This means that NodeJS, Python, and GO can use dubbo services seamlessly by either using dubbo clients in their respective languages, such as: Nod-dubbo-client, Python-dubbo-client, go-dubbo-client; Or, with a Service Mesh solution, let Dubbo provide its own cross-language solution to mask the processing details of different languages. Hence, the dubbo ecosystem’s cross-language Service Mesh solution is named Dubbo Mesh. A picture is worth a thousand words:

In the original Dubbo ecosystem, there were only consumer, provider, and registry concepts. In the Dubbo Mesh ecosystem, an agent is started for each service (each consumer, provider instance). Instead of direct communication between services, the services interact with each other through the agent. In addition, the agent completes the registration and discovery of services. The red agent in the figure is the core of this competition. Players can choose the appropriate language to implement the agent, and finally compete for high QPS implemented by their agents. QPS is the basis of the final ranking.

The problem analysis

This competition focuses on the implementation of high concurrency network communication model, covering the following key points: reactor model, load balancing, thread, lock, IO communication, blocking and non-blocking, zero copy, serialization, HTTP/TCP/UDP and custom protocols, batch processing, garbage collection, service registration discovery, etc. They have a large or small impact on the QPS of the final program, and the better you understand them, the better you can write high-performance Dubbo Mesh solutions.

The choice of language, the feeling after the preliminary contest, we mainly in Java, c++, go to make the choice. The choice of language took into account many factors, such as versatility, lightness, performance, code volume and QPS cost performance, players’ habits and so on. While the top few appear to be c++, overall, there’s no reason language features are holding you back from the top 10. Behind the c++ player’s high performance was perhaps the sacrifice of more than 600 lines of code to maintain an etcd-lib for themselves (the competition restricted the use of etcds, but according to c++ players, c++ did not provide etcd lib); And the competition provided a warm-up, the Java party also showed a happy smile. The main Java framework is nio, Akka, and Netty. Netty is probably the most popular among the Java players. The blogger chose Netty as the implementation of Dubbo Mesh. Go’s coroutines and network libraries are also two powerful tools, no less powerful than Java, and the lightweight nature of its processes makes them an option.

The official demo of dubbo Mesh is not very QPS, so it is very sweet. Let’s review the simplest implementation of Dubbo Mesh:

The diagram above shows the architecture of the entire initial Dubbo Mesh. Consumer and Provider are represented in gray, because players cannot modify their implementation. The green part of the agent is the part that players can play freely. In the competition, consumer and consumer-agent are a single instance, while Provider and provider-Agent respectively start three instances with different performance: small, medium and large. I have not shown this point in the figure, so you can imagine by yourself. So all contestants need to do the following:

  1. The consumer-agent needs to start an HTTP server to receive HTTP requests from the consumer
  2. The consumer-Agent needs to forward the HTTP request to the Provider – Agent, and because there are multiple instances of the Provider – Agent, load balancing is required. How the consumer-agent communicates with the Provider-Agent is free to play.
  3. After the provider agent receives the request from the consumer agent, it needs to assemble the Dubbo protocol and use TCP to communicate with the provider.

The result is a simple dubbo mesh across languages, where the HTTP protocol from the consumer invokes the Dubbo service written in the Java language. How to optimize and how to use all kinds of black technology made for a very interesting game. All the optimizations are not done overnight, and are submitted day by day, so it is perfectly possible to use the timeline to describe their transformation process.

Optimize the process

Qps 1000 to 2500 (CA and PA use asynchronous HTTP communication)

The official demo can directly run through the whole communication process and save us a lot of time. The initial version evaluation can reach 1000+ QPS, so 1000 can be used as the baseline for your reference. In Demo, consumers use asyncHttpClient to send asynchronous HTTP requests. The Consumer-agent uses the Servlet3.0 feature supported by SpringMVC. The communication between the consumer agent and the Provider agent uses synchronous HTTP, so the C to CA link is much better than the CA to PA link. It is also very simple to modify, referring to the C to CA design, directly replace CA to PA with asynchronous HTTP, QPS can directly reach 2500.

The main benefit is the asynchronous http-client provided by async-http-client and the non-blocking API provided by servlet3.0.

<dependency>
    <groupId>org.asynchttpclient</groupId>
    <artifactId>async-http-client</artifactId>
    <version>2.4.7</version>
</dependency>
Copy the code
// Send the HTTP request without blocking
ListenableFuture<org.asynchttpclient.Response> responseFuture = asyncHttpClient.executeRequest(request);

// Return a non-blocking HTTP response
@RequestMapping(value = "/invoke")
public DeferredResult<ResponseEntity> invoke(a){}
Copy the code

Qps 2500 to 2800 (load balancing optimized for weighted polling)

The load balancing algorithm provided in Demo is a random algorithm. A random access algorithm is selected from small-Pa, medium-Pa and large-PA. As each service has different performance and response time, the random load balancing algorithm is seriously unstable and cannot allocate requests on demand. So it became a natural second point of transformation.

The weighted polling algorithm is optimized. The implementation of this part refers to the implementation of Motan (Open source RPC framework of Weibo). See the com. Alibaba. Dubbo. Performance. Demo. Agent. Cluster. Loadbalance. WeightRoundRobinLoadBalance (at the end of the stick git address).

Configure the weight information in the startup script. When the PA starts to register the service address to the ETCD, the weight information is registered with the ETCD. When the CA pulls the service list, the load ratio can be obtained.

large:
-Dlb.weight=3
medium:
-Dlb.weight=2
small:
-Dlb.weight=1
Copy the code

The maximum concurrency during warm-up is 256 connections, which is a ratio that maximizes the performance of each PA.

Qps 2800 to 3500 (Future ->callback)

C to CA and CA to PA are non-blocking (requests do not block the I/O thread) despite HTTP traffic. However, in the Demo of Dubbo Mesh, the communication link from PA to P still uses future.get + countDownLatch. Once the whole link is locked and blocked, QPS will fail. There’s also a lot of talk about several ways to get results:

The future method does not block the thread during the invocation, but does block the thread when obtaining the result. The provider fixed sleep for 50 ms, so obtaining the future result is still a time-consuming process. In addition, this model usually uses locks to wait, which will cause significant performance degradation. The benefit of using callback instead is that the IO thread is focused on IO events, reducing the number of threads, which fits well with netty’s IO model.

Promise<Integer> agentResponsePromise = new DefaultPromise<>(ctx.executor());
agentResponsePromise.addListener();
Copy the code

Netty provides an abstraction of the DefaultPromise and a default implementation of DefaultPromise for this purpose, and we can use the callback feature out-of-box. In netty’s inbound handler’s channelRead event, create the promise, get the requestId, and map the requestId to the promise. Get the returned requestId from the outbound handler’s channelRead event, find the promise, call the done method, and complete the non-blocking request response. Reference: inbound handler ConsumerAgentHttpServerHandler and and outbound handler ConsumerAgentClientHandler implementation.

Qps 3500 to 4200 (HTTP traffic replaced with TCP traffic)

The communication between CA and PA is originally an asynchronous HTTP communication mode, which can be modified by referring to the asynchronous TCP communication between PA and P. It is also very easy to customize the communication protocol between agents. Considering the problem of TCP packet sticking, it is a common practice to use fixed-length header + byte array as the custom protocol. ProtoBuffer is used as a custom protocol, but Netty is friendly enough to provide a codec based on the protocol. You just need to write the DubboMeshProto. Proto file:

message AgentRequest {
    int64 requestId = 1;
    string interfaceName = 2;
    string method = 3;
    string parameterTypesString = 4;
    string parameter = 5;
}

message AgentResponse {
    int64 requestId = 1;
    bytes hash = 2;
}
Copy the code

The advantage of Using protoBuffer is that it can compress as many bytes as possible and reduce THE I/O stream. DubboMeshProto’s getSerializedSize,getDescriptorForType, and other methods are unnecessarily time-consuming. ProtoBuffer is not that good for such a simple data structure in this tournament. Finally, the custom protocol of fixed length header + byte array is adopted. Reference: com. Alibaba. Dubbo. Performance. The demo. Agent. Protocol. Simple. SimpleDecoder

Because of the change in HTTP communication, the CA springMVC server can also be implemented using Netty, which is more beneficial to the realization of REACTIVE CA as a whole. Implementing an HTTP server with Netty is simple, using the default codec provided by Netty.

public class ConsumerAgentHttpServerInitializer extends ChannelInitializer<SocketChannel> {
    @Override
    public void initChannel(SocketChannel ch) {
        ChannelPipeline p = ch.pipeline();
        p.addLast("encoder".new HttpResponseEncoder());
        p.addLast("decoder".new HttpRequestDecoder());
        p.addLast("aggregator".new HttpObjectAggregator(10 * 1024 * 1024));
        p.addLast(newConsumerAgentHttpServerHandler()); }}Copy the code

HTTP server implementation also step on a pit, decoding HTTP request request did not pay attention to the release of ByteBuf, resulting in QPS fell 2000+, but not as good as the implementation of SpringMVC. With the help of his teammate @Flash, he successfully locates the problem of memory leak.

public static Map<String, String> parse(FullHttpRequest req) {
    Map<String, String> params = new HashMap<>();
    // This is a POST request
    HttpPostRequestDecoder decoder = new HttpPostRequestDecoder(new DefaultHttpDataFactory(false), req);
    List<InterfaceHttpData> postList = decoder.getBodyHttpDatas();
    for (InterfaceHttpData data : postList) {
        if(data.getHttpDataType() == InterfaceHttpData.HttpDataType.Attribute) { MemoryAttribute attribute = (MemoryAttribute) data; params.put(attribute.getName(), attribute.getValue()); }}// resolve memory leak
    decoder.destroy();
    return params;
}
Copy the code

There is also a faster way to decode after the official game. Instead of using the above HttpPostRequestDecoder, use QueryStringDecoder:

public static Map<String, String> fastParse(FullHttpRequest httpRequest) {
    String content = httpRequest.content().toString(StandardCharsets.UTF_8);
    QueryStringDecoder qs = new QueryStringDecoder(content, StandardCharsets.UTF_8, false);
    Map<String, List<String>> parameters = qs.parameters();
    String interfaceName = parameters.get("interface").get(0);
    String method = parameters.get("method").get(0);
    String parameterTypesString = parameters.get("parameterTypesString").get(0);
    String parameter = parameters.get("parameter").get(0);
    Map<String, String> params = new HashMap<>();
    params.put("interface", interfaceName);
    params.put("method", method);
    params.put("parameterTypesString", parameterTypesString);
    params.put("parameter", parameter);
    return params;
}
Copy the code

Save space, directly here after the optimization of the post, the subsequent optimization is not repeated.

Qps 4200 to 4400 (Netty Multiplexed eventLoop)

This optimization point came from a friend @half-glass of water, who had never used Netty before. During the match, he had a brief update on Netty’s threading model and learned that Netty could redirect channels from clients and reuse EventLoops. If the inbound I/O thread and the outbound I/O thread use the same thread, you can reduce the unnecessary context switch, which may not be obvious at 256 concurrent, only 200 QPS difference. But it’s particularly obvious at 512. Reusing EventLoops is a special section in Netty Practice, which, though not very long, clearly explains how to reuse Eventloops (note that reusing occurs in both CA and PA).

// eventLoopGroup of the inbound server
private EventLoopGroup workerGroup;

// Pre-created channel for outbound client
private void initThreadBoundClient(EventLoopGroup workerGroup) {
    for (EventExecutor eventExecutor : eventLoopGroup) {
        if (eventExecutor instanceof EventLoop) {
            ConsumerAgentClient consumerAgentClient = newConsumerAgentClient((EventLoop) eventExecutor); consumerAgentClient.init(); ConsumerAgentClient.put(eventExecutor, consumerAgentClient); }}}Copy the code

Use the eventLoopGroup on the inbound server to pre-create a channel for the outbound client so that the eventLoop can be reused. The accompanying optimization is to replace the data structure that stores Map<requestId,Promise> from concurrentHashMap with a ThreadLocal, because the inbound and outbound threads are the same thread. Eliminating a concurrentHashMap further reduces lock contention.

At this stage, the overall architecture has been clear. C -> CA, CA -> PA, PA -> P have all implemented the asynchronous non-blocking reactor model, and THE QPS has reached 4400 QPS under 256 concurrent conditions.

The official match 512 link brings a new pattern

The above code performed well at 256 concurrency in the pre-tournament, but in the official tournament, to reflect the difference, the maximum concurrency was directly doubled, but the QPS was not improved very well, stuck at 5400 QPS. After communicating with 4400 friends under 256 connections, we found that the gap between us was mainly reflected in the number of IO threads between CA and PA, and the number of connections between PA and P. The 5400 QPS was obviously lower than I expected, and I modified the original provider-agent design to reduce the number of connections. Start with the following optimization, which is for the official race with 512 connections, and for the warm-up race with only 256 connections.

Qps 5400 to 5800 (reduced connection number)

After searching many articles on channel optimization in Netty, I’m still not sure whether the number of connections is the key factor affecting my code. After communicating with my friends, I couldn’t find the reason why the QPS card is at 5400, so I modified the design of Provider-Agent with a mindset of trying. Using the same design as the consumer-agent, it pre-obtains the Woker thread group of the Provder-Agent inbound server and creates the outbound request channel, reducing the original 4 threads and 4 channels to 1 thread. A channel. The QPS reached 5800 without any other changes.

In theory, the number of channels should not be a performance bottleneck. It may have something to do with Provider Dubbo’s thread pool policy. Using as few connections and threads as possible within the reasonable range of IO event processing capability in the server can improve QPS and reduce unnecessary thread switching. In this case, the number of CA threads is 4, the inbound connections are HTTP connections, and the maximum number is 512. The outbound connections are bound to threads and need to be load balanced

At this stage, there is another problem. Since the number of Provider threads is fixed at 200, if large-PA continues to allocate 3/1+2+3=0.5, that is, 50% of the requests, the provider thread pool is likely to be full, so the weighted value is adjusted to 1:2:2. It is not only machine performance that limits weighted load balancing, but also the connection handling capabilities of the provider.

Qps 5800 to 6100 (Epoll replace Nio)

Thanks again for the reminder of @half cup of water. Since Linux is used as the evaluation environment, we can use netty’s own encapsulated EpollSocketChannel instead of NioSocketChannel. This improvement is far beyond my imagination. Directly helped me get past the 6000 level.

private EventLoopGroup bossGroup = Epoll.isAvailable() ? new EpollEventLoopGroup(1) : new NioEventLoopGroup(1);
private EventLoopGroup workerGroup = Epoll.isAvailable() ? new EpollEventLoopGroup(2) : new NioEventLoopGroup(2);
bootstrap = new ServerBootstrap();
            bootstrap.group(bossGroup, workerGroup)
                    .channel(Epoll.isAvailable() ? EpollServerSocketChannel.class : NioServerSocketChannel.class)
Copy the code

Since I am a MAC environment and cannot use Epoll, I added the above judgment.

NioServerSocketChannel uses THE NIO of JDK. It will use different I/O models according to the operating system. In Linux, it is also epoll, but the default is level-triggered. EpollSocketChannel encapsulated by netty is edge-triggered by default. I originally thought that the gap between ET and LT led to such a big gap in QPS. However, when optimizing Epoll parameter later, I found that EpollSocketChannel can also be configured as level-triggered, and QPS does not drop. Under the special conditions of the game, My guess is not that there is a gap between these two triggers, but that netty itself encapsulates epoll as an optimization.

/ / the default
bootstrap.option(EpollChannelOption.EPOLL_MODE, EpollMode.EDGE_TRIGGERED);
// The trigger mode can be modified
bootstrap.option(EpollChannelOption.EPOLL_MODE, EpollMode.LEVEL_TRIGGERED);
Copy the code

Qps 6100 to 6300 (Agent custom protocol optimization)

I’ve talked about custom protocols between agents before, but since I used protoBuf in the first place, I found some performance issues, which I found here. The problem with protoBuf was particularly obvious at 512, and eventually it was replaced with a custom protocol, Simple protocol, for safety and compatibility with one of my later optimizations, which I mentioned earlier and won’t go into too much.

Qps 6300 to 6500 (Parameter tuning with zero-Copy)

This section of optimization comes from the communication with @sleeve-Xu Huajian, thank you very much. Another optimization point that I didn’t know much about netty:

  1. Disable Netty memory leak detection:
-Dio.netty.leakDetectionLevel=disabled
Copy the code

Netty periodically extracts 1% of ByteBuf during runtime to detect memory leaks. Disabling this parameter can improve performance.

  1. Open quick_ack:
bootstrap.option(EpollChannelOption.TCP_QUICKACK, java.lang.Boolean.TRUE)
Copy the code

TCP is different from UDP in that it is used as an ACK for reliable transmission. Netty provides this parameter for Epoll, which can be used as a quick ACK.

  1. Open TCP_NODELAY
serverBootstrap.childOption(ChannelOption.TCP_NODELAY, true)
Copy the code

This optimization is probably known to most people and is listed here. An RPC optimization article by Ali Bixuan was found on the Internet, which mentioned that channeloption. TCP_NODELAY=false might be better under high concurrency, but it was not found after the actual test.

Other tuning parameters may be metaphysical and have little impact on the final QPS. Parameter tuning isn’t a lot of skill, but it can have a significant impact on results.

An optimization is also performed at this stage, along with parameter tuning, so it is not known which has a greater impact. The Dubbo protocol coding in Demo does not achieve zero-copy, which virtually adds a copy of data from kernel mode to user mode. This problem also exists between custom protocols. In the practice of Dubbo Mesh, we should try our best to: Don’t use other objects where you can use ByteBuf. The Slice and CompositeByteBuf provides a convenient zero-copy.

Qps 6500 to 6600 (custom HTTP protocol codec)

Looking at the list of people on the QPS gradually rise, and I still stay at 6500, so I moved to the wrong idea, GTMD’s versatility, their own parse HTTP protocol got, do not netty provided HTTP codec, There is no need for a QueryStringDecoder that is faster than the HttpPostRequestDecoder, and it is very easy to implement custom parsing for a fixed HTTP request.

POST/HTTP/1.1\r\n content-Length: 560\r\n content-type: Application /x-www-form-urlencoded\r\n host: 127.0.0.1:20000 \ r \ n \ r \ n interface=com.alibaba.dubbo.performance.demo.provider.IHelloService&method=hash&parameterTypesString=Ljava%32lang%32Stri ng; &parameter=xxxxxCopy the code

The HTTP text protocol itself is a little bit complicated, so the implementation of Netty is not necessarily as fast as our own parsing for the sake of generalities. I won’t go into the details of the sticking process, which is a bit of a hack.

The response is:

HTTP/1.1 200 OK\r\n Connection: keep-alive\r\n content-type: text/plain; charset=UTF-8\r\n Content-Length: 6\r\n \r\n 123456Copy the code

Qps 6600 to 6700 (remove objects)

Continue to be crazy, do not consider generality, omit all the intermediate objects before, encode and decode do everything possible to compress into handler processing, such code looks very uncomfortable, there are a lot of hardcoding. But the effect is there, the number of YGC is reduced a lot, and ByteBuf and byte[] are used throughout the data interaction. This optimization point also has the tendency to hack, but I don’t want to repeat it too much.

Qps 6700 to 6850 (Batch Flush, batch decode)

In fact, when it comes to 6700, sometimes it depends on luck. From the joking phenomenon in the group, we can find that the NETWORK IO under 512 is very shaking. It is not clear whether it is a machine problem or an inherent phenomenon under high concurrency. Therefore, the process of 6700l 6850 is rather tortuous and unstable, with a total of two 6800+ after 20 submissions.

The optimization is from the @Flash batch flush class. The number of bytes transmitted at a time can be increased and the number of network I/OS can be reduced. The simple idea is to write 10 times and flush 1 time in netty. Two versions of batch Flush were implemented. One version accumulates based on the number of times the same channel has been written, eventually triggering a flush. Another version forces flush based on the end of an eventLoop. After a lot of testing, the environment is too shaky to detect much difference between the two.

handler(new ChannelInitializer<SocketChannel>() {
	@Override
	protected void initChannel(SocketChannel ch) {
	ch.pipeline()
		.addLast(new SimpleDecoder())
		.addLast(new BatchFlushHandler(false))
		.addLast(newConsumerAgentClientHandler()); }});Copy the code

The idea of batch decode comes from an abstract class: AbstractBatchDecoder, which is provided in THE RPC framework SOFa-Bolt of Ant Financial

Netty provides a convenient decoding tool class called ByteToMessageDecoder. This class is accumulate, which reads as many bytes as possible from the socket and then calls decode synchronly. The business objects are decoded and formed into a List. Finally, loop through the List and submit to ChannelPipeline for processing. Here we make a small change, as shown in the lower half of the figure, to commit the content from a single command to the entire List, which can reduce the number of pipeline executions and improve throughput. This mode has no advantage in low concurrency scenarios, but has a significant performance boost in throughput in high concurrency scenarios.

It is worth pointing out that the optimization effect of dubbo Mesh reuse eventLoop in special scenarios is actually questionable, but my best result is actually after using AbstractBatchDecoder. I once pulled out the ByteToMessageDecoder and AbstractBatchDecoder for a single run and it is true that the latter has a higher QPS.

conclusion

In fact, in QPS 6500, the overall code is pretty beautiful, at least feel can take to others to see. In the end, however, in order to achieve performance, and the time is relatively short, many places have carried out hardcoding. However, the actual code that can be put into production must require commonality and expansibility. After the game, I will sort out two branches: highest- QPS to pursue performance, and the other branch to keep commonality. This competition from a netty white, finally learned a lot of knowledge points, or harvest a lot, finally thank you for giving me guidance in the competition.

Highest QPS branch: highest- QPS

Consider a generic branch (suitable for getting started with Netty) : Master

Code.aliyun.com/250577914/a…

Finally, I helped my teammate @Flash to promote his Netty video tutorial. Two of the more difficult optimization points in the game were made by him. Imooc.com search Netty, you can get Netty source analysis video.

Welcome to follow my wechat official account: “Kirito technology sharing”, any questions about the article will be answered, bring more Java related technology sharing.