directory

First, operating system optimization

Second, Netty tuning

1. Set a reasonable number of threads

2. Heartbeat optimization

3. Tune the receive and send buffers

4. Make proper use of memory pools

5. Separate THE IO thread from the business thread

Third, related performance optimization at JVM level

1. Determine GC optimization objectives

2. Determine the server memory usage

3. GC optimization process


Can our web applications support megabytes of connections on a single machine? Yes, but there’s a lot of work to do. And consider whether the system resource consumption of a single machine can support millions of concurrent requests

First, operating system optimization

The first is to go beyond the limitations of the operating system.

On Linux, the maximum number of concurrent TCP connections is limited by the number of files that can be opened by a single process. (This is because the system creates a socket handle for each TCP connection. Each socket handle is also a file handle.

You can run the ulimit command to view the maximum number of files that can be opened by the current user process: $ulimit -n 1024

This means that each process of the current user is allowed to open a maximum of 1024 files at the same time. This 1024 files must be removed from the standard input, standard output, standard error, server listening socket, Unix domain socket for interprocess communication, etc. Then the number of files left available for client socket connection is about 1024-10=1014. This means that by default, a Linux-based communicator allows up to 1014 concurrent TCP connections.

For communication handlers that want to support a higher number of TCP concurrent connections, they must modify the number of files that Linux can open simultaneously for the current user’s process.

The easiest way to change the maximum number of open files for a single process is to use the ulimit command: $ulimit -n 1000000

If the command output similar to “Operation not permitted” is displayed, the above limits failed to be modified because the value specified in “Operation not permitted” exceeded the soft or hard limit for the number of files opened by the Linux user. Therefore, you need to change the soft and hard limits that Linux imposes on users on the number of open files.

Soft limit: Indicates that Linux further limits the number of files that can be opened at the same time.

Hardlimit: The maximum number of files that can be opened simultaneously on the system, calculated based on the system hardware resources (mainly system memory).

The first step is to modify the/etc/security/limits file, add the following line in the file:

* soft nofile 1000000 * hard nofile 1000000Copy the code

The ‘*’ sign indicates that the limit of all users is changed;

Soft or hard specifies whether to modify soft or hard limits. 1000000 specifies a new limit value that you want to change, the maximum number of open files (note that the soft limit value is less than or equal to the hard limit). Save the file after modification.

Step 2 modify the /etc/pam.d/login file and add the following line to the file:

 

session required /lib/security/pam_limits.soCopy the code

This tells Linux to call the pam_limits.so module after the user has logged in to the system to set the maximum limit on the number of resources that the user can use (including the maximum number of files that the user can open). The pam_limits. So the module from the/etc/security/limits the conf file reading configuration to set the limit. Save the file after modification.

Step 3 to check the maximum number of open files at the Linux system level, run the following command:

[root@VM_0_15_centos ~]# cat /proc/sys/fs/file-max
98566
Copy the code

This indicates that the Linux system can open a maximum of 98,566 files at the same time (that is, the total number of files opened by all users), which is the Linux system level hard limit, and the number of files opened by all users should not exceed this limit. In general, this system-level hard limit is the best maximum number of open files that Linux can calculate at startup based on the state of the system’s hardware resources, and should not be modified without special need, unless you want to set a value beyond this limit for user-level open files.

How do I change the system’s maximum file descriptor limit? Modify the sysctl.conf file

/etc/sysctl.conf # add fs.file_max = 1000000 # sysctl -pCopy the code

Second, Netty tuning

1. Set a reasonable number of threads

The thread pool tuning focuses on Acceptor thread pools (Netty boss NioEventLoop Group) for receiving massive device TCP connections and TLS handshaking, and the 100O worker thread pools (Nety boss NioEventLoop Group) for handling network data reads and writes and heartbeat dispatches Work Nio EventLoop Group).

For Nety server, usually only need to start a listener port for end side equipment connected to the can, but if the server cluster instance is relatively small, even a single (or dual machine cold standby) deployment, in the end side equipment in a short time, a large number of access, and thread model need to service the monitor mode to do optimization, in order to meet in a short period of time (e.g., 30 s) millions of end Side Device access requirements.

The server can monitor multiple ports and optimize the access by using the principal/slave Reactor thread model. The front end performs 4-layer gate and 7-layer load balancing through SLB.

The characteristics of the master-slave Reactor thread model are as follows: the server is NO longer using a single NO thread to receive client connections, but a separate NIO thread pool; An Acceptor registers the newly created Socketchanne with an I/O thread in the I/O thread pool (subReactor thread pool) after receiving and processing the TCP connection request (including access authentication) from the client. This thread is responsible for reading, writing, codec, and decoding of the Socketchannel. The Acceptor thread pool is used only for client login, handshake, and security authentication. Once a link is established, it is registered with the I/O thread of the back-end Sub reactor thread pool, which is responsible for subsequent I/O operations.

For optimization of an I/O worker thread pool, use the default value (number of CPU cores x 2) to test performance. During the performance test, collect the CPU usage of I/O threads to check whether there is a bottleneck. For optimization of an O worker thread pool, use the default value (number of CPU cores x 2) to test performance

Test. During the performance test, collect the CPU usage of IO threads to see whether there is a bottleneck. Specifically, observe the thread stack. If the thread stack stays in Selectorlmpl. Lock AndDoSelect for several times in a row, it indicates that the I/o thread is idle and there is no need to adjust the number of worker threads.

If hotspot read or write operations of THE I/O thread are found to stay in the execution of Channelhandler, the number of Nio EventLoop threads can be appropriately increased to improve the read and write performance of the network.

2. Heartbeat optimization

For servers connected by massive devices, the heartbeat optimization policy is as follows.

  1. The invalid connection should be detected in time and removed to prevent the backlog of invalid connection handles, resulting in OOM and other problems
  2. Set a reasonable heartbeat period to prevent the backlog of heartbeat scheduled tasks, resulting in frequent GC of the old age (GC of the new generation and the old age can cause STW, but the time difference is large), resulting in application suspension
  3. Use the link idle detection mechanism provided by Nety, do not create your own scheduled task thread pool, increase the burden of the system, and increase the potential concurrency security problems.

If the device is powered off suddenly, the connection is blocked by the firewall, the GC takes a long time, or an unexpected exception occurs in the communication thread, the link becomes unavailable and cannot be discovered in a timely manner. In particular, if the fault occurs during the service peak hours in the morning, a large number of services may fail or time out due to link unavailable, which poses a serious threat to system reliability.

From a technical point of view, to solve the reliability problem of links, the validity of links must be checked periodically. By far the most popular and common method is heartbeat detection. The heartbeat detection mechanism consists of three layers

  1. TCP layer heartbeat detection, namely TCP keep-alive mechanism, its scope is the entire TCP stack.
  2. Heartbeat detection at the protocol layer mainly exists in long-connection protocols, such as MQTT.
  3. Heartbeat detection at the application layer is implemented by each service product periodically sending heartbeat messages to each other through a convention.

The purpose of heartbeat detection is to check whether the current link is available and whether the peer is alive and can normally receive and send messages. As a highly reliable NIO framework,Nety also provides a heartbeat detection mechanism.

Common heartbeat detection policies are as follows.

  1. If no Pong reply message or Ping request message is received for N consecutive heartbeat checks, the link is considered as having a logical failure, which is called heartbeat timeout.
  2. If AN I/O exception occurs when reading and sending heartbeat messages, the link fails. This is called a heartbeat failure. Regardless of heartbeat timeout or heartbeat failure, you need to shut down the link and have the client initiate a reconnection operation to ensure that the link can be recovered.

Nety provides three link idle detection mechanisms, which can be used to easily detect heartbeat

  1. Read idle, link duration T no message is read.
  2. Write idle, link duration T no message sent
  3. Read/write idle, link duration T No message is received or sent

For millions of servers, long heartbeat cycles and timeouts are generally not recommended

3. Tune the receive and send buffers

In some scenarios, the device periodically reports data and sends heartbeat messages, and the number of messages sent and received on a single link is not large. In this scenario, you can reduce the resource usage of a single TCP connection by reducing the TCP receive and send buffer

Of course, the optimal value of the sending/receiving buffer may vary according to different application scenarios. Therefore, you need to adjust the value based on actual scenarios and performance test data

4. Make proper use of memory pools

With the development of THE JVM virtual machine and JT just-in-time compilation technology, object allocation and recycling is a very lightweight task. However, the situation is slightly different for the Buffer, especially the allocation and reclamation of direct out-of-heap memory, which is a time-consuming operation.

To maximize buffer reuse,Nety provides a buffer reuse mechanism based on memory pools.

In the mega scenario, you need to allocate at least one receive and send buffer object for each connected end-side device. In the traditional non-pooled mode, each message read and write needs to create and release ByteBuf objects. If there are 1 million connections and data or heartbeat is reported once per second, there will be 1 million times per second ByteBuf objects are requested and released, and even if the server’s memory can meet the requirements, the GC can be very stressful.

The most efficient solution to the above problem is to use a memory pool, with each NioEventLoop thread handling N links, and within the thread, the link processing is serial. If link A is processed first, it will create objects such as receive buffer. After decoding is completed, the constructed POJO object will be encapsulated into tasks and delivered to the background thread pool for execution. Then the receive buffer will be released, and the creation and release of receive buffer will be repeated for each message receiving and processing. If A memory pool is used, when link A receives A new data report, it applies for an idle ByteBuf from the memory pool of NioEventLoop. After decoding, release is called to release ByteBuf into the memory pool for subsequent link B to use.

The Nety memory pool can be implemented into two categories: out-of-heap direct memory and heap memory. Since Byte Buf is primarily used for network I/O reads and writes, using out-of-heap direct memory reduces one copy of the Byte array from user heap memory to kernel-state, resulting in higher performance. Since DirectByteBuf is expensive to create, it needs to be used with memory pools if DirectByteBuf is used. Otherwise, Heap Byte may be less cost-effective than DirectByteBuf.

By default, Netty uses the direct out-of-heap memory mode of the memory pool for I/O operations. If you need to use ByteBuf, you are advised to use the memory pool mode. If there are no network IO operations involved (just pure memory operations), you can use a heap memory pool to create memory more efficiently.

5. Separate THE IO thread from the business thread

If the server does not perform complex business logic operations, just simple memory operations and message forwarding, then the business Channelhandler can be executed directly in the IO thread by increasing the NioEventLoop worker thread pool, thus reducing a thread context switch and improving performance.

If there are complex business logic operations, it is recommended to separate IO threads from business threads. For IO threads, since there is no lock contention between them, you can create a large NioEvent Loop Group thread Group, with all channels sharing the same thread pool.

For back-end service thread pools, you are advised to create multiple small service thread pools that can be bound to I/O threads to reduce lock contention and improve back-end processing performance.

Flow control for end-to-end concurrent connections

No matter how much performance is optimized on the server side, flow control needs to be considered. When resources become bottlenecks or a large number of end-to-end devices access the system, flow control is required to protect the system. There are many flow control strategies, such as flow control for the number of end-to-end connections:

In Nety, can be very convenient to realize the flow control function: a new FlowControlchannelhandler, After the TCP link is created, the flow control logic is executed. If the flow control threshold is reached, the connection is rejected and the ChannelHandler Context’s close(method is called to close the connection.

Third, related performance optimization at JVM level

When the client number of simultaneous connections to hundreds of thousands or millions of when system a small jitter will cause serious consequences, such as server GC, cause an suspended (STW) GC for a few seconds, will lead to huge amounts of client device drops or message backlog, once the system restore, there will be a mass of equipment access or vast amounts of data to send is likely to prompt Then the server side is washed down.

Tuning at the JVM level mainly involves GC parameter optimization. Improper GC parameter setting will lead to frequent GC and even OOM exceptions, which will have a significant impact on the stable operation of the server.

1. Determine GC optimization objectives

GC(garbage collection) has three main metrics.

  1. Throughput: An important measure of GC capability, throughput is the highest performance measure that GC can support an application without considering pause times or memory consumption caused by GC.
  2. Latency: One of the most important indicators of GC capability is the pause time caused by GC, and the optimization goal is to reduce the delay time or completely eliminate the pause (STW) to avoid application jitter during execution.
  3. Memory usage: The amount of memory used when GC is normal.

The three basic principles for JVM GC tuning are as follows.

  1. The Minor GO collection principle: Each new generation GC recycles as much memory as possible to reduce the frequency of Full GC in your application.
  2. GC memory maximization principle: The more memory the garbage collector can use, the more efficient garbage collection will be, and the smoother the application will run. However, Full GO for large memory may take a long time. If FullGC can be effectively avoided, fine tuning is required.
  3. Choose two rule: The throughput, latency, and memory usage cannot be optimized at the same time. Therefore, you need to choose based on service scenarios. For most applications, throughput takes precedence over latency. Of course, for delay-sensitive services, the order needs to be adjusted.

2. Determine the server memory usage

Before tuning GC, you need to determine the memory footprint of your application so that you can set the appropriate memory for your application to improve GC efficiency. The memory footprint is related to active data, which is Java objects that live for a long time while the application is stably running. Active data is calculated by collecting GC data from the GC log to obtain the size of the Java heap occupied by the old generation while the application is stable, and the size of the Java heap occupied by the permanent generation (metadata area). The sum of the two is the memory occupied by the active data.

3. GC optimization process

  1. Collect and read GC data
  2. Set the appropriate JVM heap size
  3. Choose the appropriate garbage collector and collection strategy

GC tuning can be a multi-tuning process that involves not only parameter changes but, more importantly, business code adjustments.