Push service

I still remember a year and a half ago, I was working on a project that required the Android push service. Unlike iOS, there is no unified push service in the Android ecosystem. Although Google has Google Cloud Messaging, it is not unified even in foreign countries, let alone in China. It is directly walled off.

So in the past, most of the push on Android had to rely on polling. When we were doing technical research, we found the blog of JPUSH, which introduced some of their technical characteristics. What they mainly do is the long connection service under the mobile network. Single 50W-100W connection is really scared me! Then we adopted their free solution, and since it was a product with a small audience, their free version was enough for us. Over a year down, the operation is stable, very good!

Two years later, after changing departments, I was given the task of optimizing the company’s own long connection server.

After searching the technical data on the network again, I found that many difficulties related to it have been broken. There are also a lot of summary articles online. Single 50W-100W connection is not a dream at all, in fact, everyone can do it. But the light connection is not enough, QPS should also go up together.

So, this article is a summary of the difficulties and optimizations in implementing long connection services using Netty.

What is the Netty

Netty: http://netty.io/

Netty is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients.

The official explanation is the most accurate, which is the most attractive performance. But a lot of people have this question: it must be faster if it is implemented directly in NIO. JDBC is a lot of code, but it must be faster than iBATIS!

However, if you get to know Netty, it’s not always true!

The advantages of using Netty instead of NIO to write directly are:

  • High performance and high scale architecture design, most of the time you only need to focus on the business, not the architecture
  • Zero-copy technology minimizes memory copying
  • Native Socket for Linux implementation
  • Write the same code that is compatible with Java 1.7 NIO2 and NIO before 1.7
  • Pooled Buffers greatly reduce Buffers and release pressure on Buffers

There are so many features, you can check out Netty in Action for more.

In addition, Netty source is a very good textbook! You can look at its source code in the process of use, very good!

What’s the bottleneck

What is the ultimate goal of doing a long chain service? And what’s the bottleneck?

In fact, there are two main goals:

  • More Connections
  • Higher QPS

So, let’s talk about the difficulties and attention points for these two goals.

More Connections

Non-blocking IO is actually not difficult to reach a million connections with either Java NIO or Netty. Since they are non-blocking IO, there is no need to create a thread for each connection.

For more information, search for BIO,NIO, and AIO.

Java NIO implements millions of connections

ServerSocketChannel ssc = ServerSocketChannel.open(); Selector sel = Selector.open(); ssc.configureBlocking(false); ssc.socket().bind(new InetSocketAddress(8080)); SelectionKey key = ssc.register(sel, SelectionKey.OP_ACCEPT); while(true) { sel.select(); Iterator it = sel.selectedKeys().iterator(); while(it.hasNext()) { SelectionKey skey = (SelectionKey)it.next(); it.remove(); if(skey.isAcceptable()) { ch = ssc.accept(); }}}

This code only accepts incoming connections, does nothing, and is only used to test the standby limit.

And you can see that this code is basically NIO, nothing special.

Netty implements millions of connections

NioEventLoopGroup bossGroup =  new NioEventLoopGroup();  
NioEventLoopGroup workerGroup= new NioEventLoopGroup();  
ServerBootstrap bootstrap = new ServerBootstrap();  
bootstrap.group(bossGroup, workerGroup);  
  
bootstrap.channel( NioServerSocketChannel.class);  
  
bootstrap.childHandler(new ChannelInitializer<SocketChannel>() {  
    @Override protected void initChannel(SocketChannel ch) throws Exception {  
        ChannelPipeline pipeline = ch.pipeline();  
        //todo: add handler  
    }});  
bootstrap.bind(8080).sync();  

This is also very simple Netty initialization code. Again, there is nothing special about achieving a million connections at all.

The two different implementations are very simple, without any difficulty, and one must ask: what is the bottleneck to achieving a million connections?

In fact, as long as Java uses non-blocking IO (NIO and AIO are counted), they can use a single thread to achieve a large number of Socket connections. One thread per connection is not created, as in BIO, because the code level does not become a bottleneck.

The real bottleneck is in the Linux kernel configuration, where the default configuration limits the maximum number of global Open Files and also limits the number of processes. So you need to make some changes to the Linux kernel configuration.

This thing seems very simple now, according to the configuration of the Internet to change the line, but we must not know how difficult the first study of this person.

It wasn’t hard to verify that the server supported millions of connections. We quickly got a test server in place, but the big question was, how do I verify that the server supported millions of connections?

We wrote a test client using Netty, which also uses non-blocking IO, so we don’t have to open a lot of threads. However, there is a limit to the number of ports on a machine, and with root access, the maximum number of connections is 6W. So let’s write a client in Netty and use up all the connections on a single machine.

NioEventLoopGroup workerGroup = new NioEventLoopGroup(); Bootstrap b = new Bootstrap(); b.group(workerGroup); b.channel( NioSocketChannel.class); b.handler(new ChannelInitializer<SocketChannel>() { @Override public void initChannel(SocketChannel ch) throws Exception  { ChannelPipeline pipeline = ch.pipeline(); //todo:add handler } }); for (int k = 0; k < 60000; K++) {// Please modify it to server IP B.Bonnect (127.0.0.1, 8080); }

The code is also very simple, just connect to the line, do not need to do any other operations.

Just find a computer and start the program. One thing to note here is that it is best to modify the Linux kernel parameter configuration on the client side as on the server side.

How do you find all those machines

According to the above method, a single machine can have a maximum of 6W connections, a million connections need at least 17 machines!

How do you break through this limit? The limitation comes from the network card. We solved the problem by using the virtual machine and configuring the virtual network card of the virtual machine into bridge mode.

Depending on the size of the physical machine memory, a single physical machine can run at least 4-5 virtual machines, so eventually only 4 physical machines will be enough for a million connections.

The clever way

In addition to squeezing machine resources with virtual machines, there is another neat way to do this, which I stumbled upon during validation.

Under the TCP/IP protocol, either party sends a FIN and the normal disconnect process is initiated. In the event of a network transient, the connection will not be disconnected automatically.

Can we do that?

  • When starting the server, do not set the keep-alive property of the Socket. It is not set by default
  • Connect to the server with a virtual machine
  • Forced shutdown of the virtual machine
  • Modify the MAC address of the virtual machine network card, restart and connect to the server
  • The server accepts new connections and keeps the previous ones intact

What we want to verify is the server’s limit, so just keep making the server think there are so many connections, right?

In our tests, this method works just as well as using a real machine to connect to the server, because the server just thinks the network is bad and won’t disconnect you.

In addition, keep-alive is disabled because the Socket connection will automatically detect if the connection is available if it cannot be used and will be forcibly disconnected if it is not.

Higher QPS

Since both NIO and Netty are non-blocking IO, no matter how many connections there are, only a small number of threads are required. And the QPS does not decrease as the number of connections increases (provided there is sufficient memory).

And Netty itself is designed well enough not to be a bottleneck for high QPS. What is the bottleneck to high QPS?

It’s the design of the data structure!

How to optimize the data structure of the first to be familiar with the characteristics of various data structures is necessary, but in complex projects, not a collection can be done, sometimes is often a combination of various sets of use.

It’s really hard to achieve high performance, consistency, and no deadlocks…

The lesson I’ve learned here is don’t optimize too early. Prioritize consistency, make sure the data is accurate, and then look for ways to optimize performance.

Because consistency is much more important than performance, and for many performance problems, the bottleneck is in a completely different place for small and large quantities. Therefore, I think the best approach is to write with consistency as the priority and performance as the secondary; When the code is done, go back to that TOP1 and solve it!

Solving CPU bottlenecks Before doing this optimization, crushing your server hard in a test environment can make a big difference.

With stress testing in place, you need tools to find performance bottlenecks!

I like to use VisualVM, open the tool and look at the sampler, in reverse order according to Self Time (CPU), the first one is the point you need to optimize!

Note: What is the difference between a Sample and a Profiler? The former is sampling, the data is not the most accurate but does not affect performance; The latter is statistically accurate, but very performance impairing. If your program is CPU intensive, use Sample as much as possible, otherwise turning on the Profiler will degrade performance and impair accuracy.



Remember that the first bottleneck we found in our project turned out to be the size() method in the ConcurrentLinkedQueue class. When the Queue is small, it doesn’t matter, but when the Queue is large, it counts the total from scratch every time, and the size() method is called very frequently, so it affects performance.

The size() implementation is as follows:

public int size() { int count = 0; for (Node<E> p = first(); p ! = null; p = succ(p)) if (p.item ! = null) // Collection.size() spec says to max out if (++count == Integer.MAX_VALUE) break; return count; }

We solved the problem later by using an additional AtomicInteger for counting. But isn’t it possible to achieve high consistency after separation? That’s OK, this part of our code is concerned with final consistency, so we just need to make sure it’s final consistency.

In short, specific case to specific analysis, different business to use different implementation.

Resolving GC Bottlenecks GC bottlenecks are also part of the CPU bottleneck, because improper GC can significantly affect CPU performance.

I’m still using VisualVM, but you need to install a plugin: VisualGC



With this plugin, you can see the GC activity visually.

As we understand it, it is normal to have a large number of New GCs during the crush test because of the large number of objects being created and destroyed.

But having a lot of Old GCs in the first place is a little weird!

It turns out that in our stress test environment, Netty’s QPS didn’t correlate with the number of connections, so we only connected a small number of connections. Not a lot of memory is allocated.

In the JVM, the default ratio of the new generation to the old generation is 1:2, so a lot of the old generation is wasted and the new generation is not enough.

By adjusting -XX:NewRatio, the Old GC has been significantly reduced.

Production environments, on the other hand, don’t have as large a QPS, but there are a lot of connections, and the objects associated with the connections live for a long time, so production environments should allocate more old generations.

In short, GC optimization, like CPU optimization, also needs constant tuning, continuous optimization, not overnight.

Other optimization

If you’ve completed your own procedure, be sure to check out Netty Best Practices A.K.A Faster == Better, from the authors of Netty in Action.

I believe you will benefit a lot. After some small optimizations mentioned above, our overall QPS has been improved a lot.

Finally, Java 1.7 is much better than Java 1.6! Because Netty is written in an event style, it looks like AIO. Java 1.6 does not have AIO, and Java 1.7 does support AIO, so there will be significant performance improvements if you use Java 1.7.

The final results

After several weeks of constant pressure testing and optimization, we achieved 600,000 connections and 200,000 QPS with Java 1.6 on a 16-core machine with 120 gigabytes of memory (the JVM only allocated 8 gigabytes).

In fact, this is not the limit, the JVM only allocated 8GB of memory, the memory configuration is larger than the number of connections can go up;

QPS seems to be very high, and System Load Average is very low, which means that the bottleneck is not in CPU or memory, so it should be in IO. The above Linux configuration was configured for millions of connections and was not optimized for our own business scenario.

Because the current performance is fully adequate, the online QPS is only 1W at most, so we first put our energy in other places. I believe that we will continue to optimize the performance of this section, and look forward to a greater breakthrough in QPS!

Source: the dozer. Cc / 2014/12 / netty – long – connection. HTML