We’ll start today by talking about Netty, which we all know is a high-performance asynchronous event-driven network framework.

Its design is exceptionally elegant and simple, with high scalability and strong stability. Have very detailed and complete user documentation.

At the same time built-in a lot of very useful modules basically do out of the box, users only need to write a few lines of code, can quickly build a high throughput, low latency, less resource consumption, high performance (unnecessary memory copy minimization) and other characteristics of the high concurrency network application.

In this paper, we discuss the network IO model supporting Netty with high throughput and low latency characteristics —- Netty.

Starting with Netty’s network IO model, we will officially open this series of Netty source code parsing prelude:

Network packet receiving process

  • whenNetwork data frameWhen the network transmission reaches the network card, the network card will pass the network data frameThe DMA wayIn theRingBuffer RingBufferIn the.

RingBuffer is the RingBuffer queue allocated and initialized by the network card at startup. When RingBuffer is full, incoming packets are discarded. You can run the ifconfig command to check whether the network adapter sends and receives data packets. Overruns indicates the packets that are discarded when RingBuffer is full. If packet loss occurs, run the ethtool command to increase the length of RingBuffer.

  • whenDMA operation complete“, the nic sends one to the CPUHardware interruptAnd tellCPUNetwork data arrived. The CPU is registered by invoking the NIC driverHard interrupt responder. The NIC hard interrupt responder creates the kernel data structure for the network data framesk_bufferAnd frame the network datacopytosk_bufferIn the. Then launchSoft interrupt requestTo informThe kernelA new network data frame arrived.

The SK_buff buffer is a bidirectional linked list that maintains the network frame structure. Each element in the linked list is a network frame. Although the TCP/IP stack has several layers, passing between the upper and lower layers actually requires manipulation of Pointers in the data structure without data replication.

  • A kernel threadksoftirqdWhen a soft interrupt request is found, the nic driver is called to registerThe poll function.The poll functionwillsk_bufferIn theNetwork packetRegistered in the kernel protocol stackIp_rcv functionIn the.

Each CPU is bound to a KsoftirQD kernel thread dedicated to handling soft interrupt responses. With two cpus, there are two kernel threads, ksoftirQd /0 and KsoftirQd /1.

One thing to note here is that when the NIC receives the data and the DMA copy completes, it issues a hard interrupt to the CPU. The soft interrupt request made in the NIC hard interrupt responder will also be responded to in the ksoftirQD thread bound to the CPU. Therefore, if Linux soft interrupts are found that CPU consumption is concentrated on one core, the CPU affinity of hard interrupts needs to be adjusted to break the hard interrupts to the disconnected CPU core.

  • inIp_rcv functionIn the picture aboveThe network layer.Take out theA packet ofThe IP header, determine the direction of the next hop of the packet. If the packet is sent to the local machine, the protocol type of the transport layer (TCPorUDP), andTo get rid ofA packet ofThe IP headerAnd give the packet to the above figureThe transport layerTo deal with.

The processing functions of the transport layer are as follows: TCP corresponds to tcp_rCV registered in the kernel protocol stack; UDP corresponds to UDP_rCV registered in the kernel protocol stack.

  • When we use TCP protocol, when the packet arrives at the transport layer, it will be processed in the tcp_rCV function in the kernel protocol stack, remove the TCP header in the tcp_RCV function, and search the corresponding Socket according to the quad (source IP, source port, destination IP, destination port). If the corresponding Socket is found, the transmission data in the network packet is copied to the Socket receive buffer. If not, an ICMP packet with unreachable destination is sent.

  • Kernel in the receiving network packets do we introduce finished, now we put the view in the application layer, when our program through the system call read while reading the Socket to receive the data in the buffer, if there is no data in the receive buffer, then the application will be blocked in the system call, until the Socket receive buffer data, The CPU then copies the data from the kernel space (Socket receive buffer) to user space, and finally the system call read returns and the application reads the data.

Performance overhead

The kernel does a lot of work for us in terms of how it handles network packet reception so that our application can read the network data.

With the coming also brings a lot of performance overhead, combined with the network packet receiving process introduced above, let’s look at the network packet receiving process has what performance overhead:

  • The application passesThe system callsfromUser modetoKernel modeOverhead and system callsreturnFrom theKernel modetoUser modeOverhead.
  • Network data fromThe kernel spacethroughCPU copytoThe user spaceOverhead.
  • A kernel threadksoftirqdThe responsesoftirqsOverhead.
  • CPUThe responseHardware interruptOverhead.
  • The DMA copyNetwork packet tomemoryOverhead in.

Network packet sending process

  • When we call the send system call in the application to send data, the thread will have a conversion from the user state to the kernel state. In the kernel, we will first find the real Socket according to the FD, which records the function address of each protocol stack, and then construct the struct MSGHDR object. Encapsulate all the data the user needs to send in this struct MSGHDR structure.

  • The kernel protocol stack function inet_sendmsg is called and the sending process enters the kernel protocol stack for processing. After entering the kernel protocol stack, the kernel finds the protocol-specific sending function on the Socket.

For example, if we use TCP, the corresponding TCP sending function is tcp_sendmsg, and if we use UDP, the corresponding sending function is UDp_sendmsg.

  • inTCP protocolSending function oftcp_sendmsgTo create a kernel data structuresk_bufferThat will be

The sent data in the struct MSGHDR structure is copied to sk_buffer. Call tcp_write_queue_tail to get the tail element of the Socket send queue, and add the newly created SK_buffer to the tail of the Socket send queue.

The Socket send queue is a bidirectional linked list composed of SK_buffers.

Send the process go to here, the user to send data managed to copy from user space into the kernel, at this time while the data sent has been copied to the kernel in the Socket send queue, but do not represent the kernel will start to send, because TCP flow control and congestion control, the user to send packets will not necessarily be sent out immediately, The TCP transmission conditions must be met. If the send condition is not met, the SEND system call returns directly.

  • If the send criteria are met, the tcp_write_xmit kernel function is called. In this function, the sk_buffer in the Socket send queue is iterated, and congestion control and sliding window management are performed.

  • Copy the SK_buffer from the Socket send queue and set the TCP HEADER in the SK_buffer copy.

The SK_buffer contains all of the network protocol headers. When setting the TCP HEADER, you simply point the pointer to the appropriate sk_buffer location. When setting the IP HEADER, you just need to move the pointer to avoid frequent memory requisition and copy.

Why not just use itSocketIn the send queuesk_bufferYou need to make a copy of it?

Because TCP supports packet loss and retransmission, sk_buffer cannot be deleted without receiving an ACK from the peer end. Each time the kernel calls the nic to send data, it actually passes a copy of sk_buffer. When the nic sends data, the copy of SK_buffer is freed. The SK_buffer in the Socket send queue is deleted only after the peer ACK is received.

  • Once the TCP header is set, the transport layer of the kernel protocol stack is done. The next step is to call the ip_queue_xmit kernel function to process the network layer of the kernel protocol stack.

    • checkSocketIs there a cached routing table? If not, search for the routing entry and cache it toSocketIn the. Then set the routing table tosk_bufferIn the.

    You can run the route command to view the local route configuration.

    • Move the pointer in sk_buffer to the IP header position and set the IP header.

    • Perform netfilters filtering. After filtering, fragments are performed if the data is larger than the MTU.

    If you have some rules configured using Iptables, this will check if the rules are hit. If you set up very complex Netfilter rules, this function will cause your thread CPU overhead to increase significantly.

  • The neighbor subsystem, located between the network layer and the network interface layer of the kernel protocol stack, sends ARP requests for MAC addresses and then moves the pointer in sk_buffer to fill the MAC header.

  • After processing by the neighbor subsystem, a complete data frame is now encapsulated in SK_buffer, which is then handed over by the kernel to the network device subsystem for processing. The network equipment subsystem mainly does the following things:

    • Select the send queue (RingBuffer). Because the NETWORK adapter has multiple send queues, you need to select a send queue before sending packets.
    • willsk_bufferAdd to the send queue.
    • Loop from the send queue (RingBufferA) to retrievesk_buffer, call the kernel functionsch_direct_xmit Sends data, which will be calledNic driverTo send data.

When the CPU quota allocated to the user thread is used up, a soft interrupt of type NET_TX_SOFTIRQ will be triggered. The kernel thread ksoftirQD will respond to this soft interrupt. And execute net_tx_action, the softinterrupt registered callback function of type NET_TX_SOFTIRQ, in which the driver function dev_hard_start_xmit is executed to send data.

Note: When triggeredNET_TX_SOFTIRQWhen a soft interrupt is used to send data, the CPU consumed behind it is displayed insi The system state time of the user process will not be consumed.sy).

It can be seen from here that the process of sending and receiving network packets is different. When introducing the process of receiving network packets, we mentioned that the kernel network protocol stack accepts data in the kernel thread ksoftirQD by triggering NET_RX_SOFTIRQ type soft interrupt. NET_TX_SOFTIRQ soft interrupt is triggered only when the thread’s CPU quota is exhausted to send data.

Soft interrupts of type NET_TX_SOFTIRQ are triggered only when network packets are sent and when the user thread’s CPU quota is exhausted during the entire network packet sending and receiving process. The remaining soft interrupt types triggered during receiving and after sending data are both NET_RX_SOFTIRQ.

So that’s why if you look at /proc/softirqs on the server, NET_RX is usually much larger than NET_TX.

  • Now the sending process has finally arrived at the stage of the network card sending data. We talked about the kernel state of the user thread or the soft interrupt of NET_TX_SOFTIRQ type triggered when sending data, it will finally call the driver function dev_hard_start_xmit of the network card to send data. In the nic driver function dev_hard_start_xmit, sk_buffer is mapped to the MEMORY DMA region that the NIC can access. Finally, the NIC driver sends data frames through the physical NIC via DMA.

  • When the data is sent, there is one last important task, which is cleaning up. After data is sent, the NIC device sends a hard interrupt to the CPU. The CPU invokes the hard interrupt responser registered with the NIC driver to trigger a NET_RX_SOFTIRQ soft interrupt in the hard interrupt response. The SK_buffer is cleared and released in the IGB_poll callback function of the soft interrupt. Clear the network card send queue (RingBuffer) and remove DMA mapping.

Soft interrupts that are triggered from hard interrupts are NET_RX_SOFTIRQ, either because there is data to receive or because a notification of completion is sent.

Only a copy of the SK_buffer is released and cleaned. The real SK_buffer is still stored in the Socket’s send queue. As mentioned earlier in the transport layer processing, sk_buffer has not been removed because the transport layer needs to ensure reliability. It will not delete until it receives an ACK from the other party.

Performance overhead

Before, we mentioned the performance cost involved in the process of receiving network packets. Now that we have introduced the process of sending network packets, let’s look at the performance cost in the process of sending network packets:

  • The cost of switching from user to kernel when the system call SEND is called and from kernel to user when the system call returns is the same as the cost of receiving data.

  • NET_TX_SOFTIRQ soft interrupts are triggered when user-thread kernel CPU quota is exhausted. The kernel responds to the cost of soft interrupts.

  • After the nic sends data, it sends a hard interrupt to the CPU. The CPU responds to the hard interrupt. And sending NET_RX_SOFTIRQ soft interrupts in hard interrupts to perform specific memory cleaning actions. The cost of kernel response to soft interrupts.

  • Memory copy cost. Let’s review what memory copies take place during packet delivery:

    • In the transport layer of the kernel protocol stack,TCP protocolCorresponding send functiontcp_sendmsg Can apply forsk_bufferTo send data to the usercopytosk_bufferIn the.
    • When the sending process goes from the transport layer to the network layer, it willcopyaA copy of the sk_bufferCome out and take thisA copy of the sk_bufferPass it down. The originalsk_bufferRemain in theSocketSend queue, waiting for network peerACKTo endACKAfter deletingSocketIn the send queuesk_buffer. The peer end did not send the packet. ProcedureACK, and then re-start fromSocketSend queue send, implementationTCP protocolReliable transmission.
    • At the network layer, if the data to be sent is greater thanMTU, the sharding operation will be performed to apply for additionalsk_bufferAnd put the original SK_buffercopyInto multiple small SK_buffers.

Talking more about (blocking, non-blocking) and (synchronous, asynchronous)

After we talk about the process of receiving and sending network data, let’s talk about the particularly confusing concepts of IO: blocking versus synchronous, non-blocking versus asynchronous.

Various posts online there are all kinds of books in a large number of explanation about these two concepts, but I feel or visualization, enough just to explain the concept of stiff, if hard set of concepts, actually felt block and synchronization, non-blocking and asynchronous or do not have what distinction, time grew, confusing or vague.

So the author tries to explain clearly what is blocking and non-blocking and what is synchronous and asynchronous in a more visual and memorizable way.

After the introduction of network packet receiving process, we can summarize the whole process into two stages:

  • Data preparation stage:At this stage, the network packet arrives at the nic, passing throughDMA

And then through hard interrupt, soft interrupt, and then through the kernel thread KsoftirQD through the processing of the kernel protocol stack, and finally send the data to the kernel Socket receive buffer.

  • Data copy phase:When the data arrivesThe kernel SocketWhen the data exists in the receive bufferThe kernel spaceIn, the data needs to be transferredcopytoThe user spaceCan be read by the application.

Blocking and non-blocking

The difference between blocking and non-blocking occurs mainly in the first phase: the data preparation phase.

When the application makes the system call read, the thread goes from user state to kernel state and reads the network data in the receive buffer of the kernel Socket.

blocking

If there is no data in the kernel Socket’s receive buffer, the thread will wait until there is data in the Socket’s receive buffer. The data is then copied from kernel space to user space, and the system call read returns.

As we can see from the figure, blocking is characterized by waiting in both the first and second phases.

non-blocking

The main distinction between blocking and non-blocking is in the first phase: the data preparation phase.

  • In the first phase, the application thread waits in blocking mode when there is no data in the Socket’s receive buffer. In non-blocking mode the application thread does not wait and the system call returns the error flag EWOULDBLOCK.

  • When there is data in the Socket’s receive buffer, blocking and non-blocking behave the same, both entering phase 2 and waiting for the data to be copied from kernel space to user space before the system call returns.

From the figure above, we can see that the non-blocking characteristic is that the first phase does not wait, but the second phase still waits.

Synchronous and asynchronous

The main difference between synchronous and asynchronous occurs in the second stage: the data copy stage.

Earlier, we mentioned that the data copy phase mainly involves copying data from kernel space to user space. Then the application can read the data.

Phase 2 starts when data arrives in the kernel Socket’s receive buffer.

synchronous

In synchronous mode, the second stage is performed by the kernel state of the user thread after the data is ready. So the application blocks in the second phase, and the system call does not return until the data is copied from kernel space to user space.

Epoll in Linux and Kqueue in Mac belong to synchronous I/OS.

asynchronous

In asynchronous mode, the kernel performs the phase 2 data copy operation. When the kernel completes the phase 2 data copy operation, the kernel notifies the user thread that the I/O operation is complete and calls back the data to the user thread. Therefore, in asynchronous mode, the data preparation and data copy phases are done by the kernel without causing any blocking to the application.

Based on the above characteristics, we can see that asynchronous mode requires kernel support and is more dependent on the underlying support of the operating system.

In the current popular operating systems, only the IOCP in Windows really belongs to asynchronous IO, and the implementation is very mature. But Windows is rarely used as a server.

Linux, which is often used as a server, has a less mature implementation of asynchronous IO and a less significant performance improvement than NIO.

However, the New asynchronous IO library IO_uring, introduced by Facebook’s Jens Axboe in version 5.1, ameliorates some of the performance issues of the original Linux Native AIO. Performance improvements over Epoll and previous native AIO are noteworthy.

IO model

The IO performance of network framework is largely determined by what IO model is used to read and write data during network IO operation. So the choice of IO model is the basis of building a high-performance network framework.

In the book UNIX Network Programming, five IO models are introduced: blocking IO, non-blocking IO,IO multiplexing, signal-driven IO, and asynchronous IO. Each IO model is an upgrade of the previous one.

Here we will introduce the problems solved by these five IO models, which scenarios are suitable for them, and what are their advantages and disadvantages.

Blocking IO (BIO)

After the introduction to the concept of blocking in the previous section, I believe that you can easily understand the concept and process of blocking IO.

Since we are talking about IO in this section, let’s look at the process of reading and writing network data under the blocking IO model.

Block read

When a user thread makes a read system call, the user thread switches from user state to kernel state, where it checks the Socket receive buffer for incoming data.

  • If there is data in the Socket receive buffer, the user thread copies the data from the kernel space to the user space in the kernel state, and the system IO call returns.

  • If there is no data in the Socket receiving buffer, the user thread relinquishes the CPU and enters the blocking state. When the data reaches the Socket receive buffer, the kernel wakes up the user thread in the blocked state to enter the ready state, then gets the CPU quota through the CPU’s schedule to enter the running state, copies the data from the kernel space to the user space, and the system call returns.

Block write

When a user thread initiates a send system call, the user thread switches from the user state to the kernel state and copies the send data from user space to the Socket send buffer in the kernel space.

  • When the Socket send buffer is large enough to hold the sent data, the user thread writes all the sent data to the Socket buffer, performs the subsequent flow described in the network Packet Send Flow section, and returns.

  • When the Socket send buffer space is insufficient to hold all the sent data, the user thread relinquishes the CPU and enters the blocking state. Until the Socket send buffer space is sufficient to hold all the sent data, the kernel wakes up the user thread to execute the subsequent send process.

Write operations in the blocking I/O model tend to be rigid, requiring that all sent data be written to the send buffer.

Blocking IO model

Due to the read-write characteristics of blocking I/O, each request needs to be processed by an independent thread in the blocking I/O model. A thread can only be bound to one connection at a time. To receive a request, the server needs to create a thread to handle the request.

When the number of concurrent requests from the client suddenly increases, the server will create a large number of threads in a flash, and the creation of threads requires system resource overhead, which will consume a large number of system resources in a flash.

If the client creates a connection but does not send data, and the network connection is not always readable in most cases, the server thread will remain blocked and cannot do anything else during the idle time. The CPU is also underutilized and there is a lot of thread switching overhead.

Applicable scenario

Based on the characteristics of the blocking I/O model, this model is only applicable to service scenarios with a small number of connections and low concurrency.

For example, in some management systems within the company, the number of requests is usually around 100, and the blocking IO model is very suitable. And the performance is as good as NIO.

Before C10K, this model was a widely used IO model.

Non-blocking IO (NIO)

The biggest problem with the blocking I/O model is that a thread can only handle one connection, and if there is no data on that connection, that thread can only block system I/O calls and nothing else. This is a huge waste of system resources. At the same time, a large number of thread context switches are also a huge system overhead.

To solve this problem, we need to handle as many connections as possible with as few threads. The evolution of network IO model is also based on this requirement step by step.

Based on this requirement, the first solution, non-blocking IO, emerged. We introduced the concept of non-blocking in the previous section. Now let’s look at the characteristics of network read/write operations under non-blocking IO:

A non-blocking read

When a user thread makes a non-blocking read system call, the user thread moves from user state to kernel state, where it checks the Socket receive buffer for incoming data.

  • The system call returns immediately with an EWOULDBLOCK or EAGAIN error in the Socket receive buffer, at which point the user thread will not block or yield the CPU, but will continue to rotate until there is more data in the Socket receive buffer.

  • There is data in the Socket receive buffer, and the user thread will copy the data from the kernel space to the user space in the kernel state. Note that the application is blocked during the data copy phase, and when the data copy is complete, the system call returns.

Write a non-blocking

Before we introduced the blocking write when we mentioned that the blocking write style is particularly strong, the header is relatively iron to write all the sent data once to the Socket send buffer before returning, if there is not enough space in the send buffer, then the block has been dead, especially just.

In comparison, non-blocking writes have the same characteristics as The Buddha. When there is not enough space in the send buffer to hold all the data sent, non-blocking writes write as much as possible, and return immediately when there is no more space. The number of bytes written to the send buffer is returned to the application so that the user thread can constantly rotate trying to write the remaining data to the send buffer.

Non-blocking IO model

Due to the above non-blocking IO characteristics, we do not need to allocate a thread per request to handle reads and writes on the connection as we do with blocking IO.

We can use a thread or a few threads, to constantly polling the receive buffer of the Socket for each data to arrive, if there is no data, don’t need to block the thread, but went on to receive buffer, polling under a Socket to the polling data, after processing on the connection, speaking, reading and writing, business decisions or to the thread pool, The polling thread continues polling the other Socket receive buffers.

Such a non-blocking IO model fulfills the requirement we raised at the beginning of this section: we need to process as many connections as possible with as few threads

Applicable scenario

Although the non-blocking IO model reduces a large part of the resource consumption and system overhead compared with the blocking IO model.

But it still has a big performance problem, because under the non-blocking IO model, the user thread is constantly making system calls to train the Socket receive buffer, which requires the user thread to constantly switch from user state to kernel state, and kernel state to user state. As the amount of concurrency increases, the overhead of this context switch is significant.

Therefore, the pure non-blocking IO model is still not suitable for high concurrency scenarios. This parameter is applicable only to scenarios with C10K or less.

IO multiplexing

At the beginning of the section on non-blocking IO, we mentioned that the evolution of the network IO model has been centered around the core requirement of how to handle as many connections as possible with as few threads.

In this section we will talk about the IO multiplexing model. What is multiplexing? What is reuse?

Let’s start with this core requirement:

  • Multiplexing: Our core requirement is to process as many connections as possible with as few threads as possible, and by multiplexing we mean the number of connections we need to process.

  • Reuse: The core requirement is to process as many connections (multiplexing) as possible with as few threads and as little overhead as possible, and by reuse I mean using limited resources, such as a single thread or a fixed number of threads to process read and write events on many connections. In other words, in the blocking IO model, a connection needs to be allocated a separate thread to handle reads and writes on that connection. In the IO multiplexing model, multiple connections can reuse a single thread to handle reads and writes on multiple connections.

Ok, the concept of the IO multiplexing model is explained, so the key question is how do we implement this multiplexing, that is, how do we have a single thread handling read and write events on many connections?

The answer to this question is given in the non-blocking IO model, in which non-blocking system IO calls are used to continuously poll the Socket receiving buffer of many connections to see if there is data coming, process it if there is data, and continue polling the next Socket if there is no data. This allows one thread to handle read and write events on many connections.

However, the biggest problem of the non-blocking IO model is that it needs to continuously launch system calls to poll each Socket for incoming data in the receiving buffer. Frequent system calls bring a lot of context switching overhead. This can also lead to serious performance issues as concurrency increases.

So how do we avoid frequent system calls while still implementing our core requirements?

This requires the operating system kernel to support such operations. We can delegate frequent polling operations to the operating system kernel to do it for us, thus avoiding the performance overhead of frequent polling using system calls in user space.

As we can imagine, the operating system kernel does provide this functionality. Let’s take a look at the implementation of the IO multiplexing model for the operating system.

select

Select is a system call provided by the operating system kernel. It solves the system overhead of switching between user space and kernel space in the non-blocking IO model, which needs to constantly initiate system IO calls to poll the Socket receiving buffer on each connection.

The SELECT system call hands off polling to the kernel to help us do so, avoiding the system performance overhead of constantly polling in user space.

  • First, the user thread blocks on the SELECT system call when it initiates the call. At this point, the user thread switches from user to kernel mode, completing a context switch

  • The user thread passes the fd array of file descriptors corresponding to the Socket it needs to listen on to the kernel via the SELECT system call. At this point, the user thread copies the file descriptor FD array from user space into the kernel space.

The array of file descriptors is actually a BitMap. The BitMap subscript is fd. The value of the subscript is 1 indicating that there are read/write events on the FD and 0 indicating that there are no read/write events on the FD.

The file descriptor fd is actually an integer value. In Linux everything is a file and Socket is a file. Task_struct files_struct *files, which ends up pointing to an array containing a list of all files that the process has opened. The file information is wrapped in the struct file structure. The type of the array is a struct file structure, and the subscript of the array is the file descriptor fd.

  • When the user thread is finished callingselectAfter entryThe blocking state.The kernelStart polling traversalfdArray, viewfdThe correspondingSocketWhether data arrives in the receive buffer. If data arrives, it willfdThe correspondingBitMapIs set to1. If no data arrives, keep the value as0.

Note that the kernel modifies the original fd array!!

  • After iterating through the fd array, the kernel returns the modified FD array to the user thread if it finds IO data on any of the FDS. At this point, the FD array is copied from kernel space to user space.

  • When the kernel returns the modified fd array to the user thread, the user thread unblocks and starts traversing the fd array to find the Socket file descriptor with the value 1 in the FD array. Finally, system calls are made to these sockets to read the data.

Select does not tell the user thread which FDS are receiving I/O data. Instead, it marks the active FDS and returns the full array of FDS marked, so the user thread needs to traverse the array to find out which FDS are receiving I/O data.

  • Because the kernel has been modified during traversalfdArray, so after the user thread traversesfdArrayIO readytheSocketAfter, needresetFd array and re-callselectPass in the resetfdArray, allowing the kernel to initiate a new round of traversal polling.

The API is introduced

Once we are familiar with the principles of Select, it is easy to understand the SELECT API provided by the kernel.

 int select(int maxfdp1,fd_set *readset,fd_set *writeset,fd_set *exceptset,const struct timeval *timeout)
Copy the code

As you can see from the SELECT API, the SELECT system call listens (polls) for readable, writable, and exception events on the set of file descriptors of interest to the user for a specified timeout period.

  • Maxfdp1: select the file descriptor +1 with the largest value in the set of file descriptors passed to the kernel listener, which is used to limit the scope of the kernel traversal. For example, if the set of file descriptors listened by select is {0,1,2,3,4}, then maxfdp1 has the value 5.

  • Fd_set * readSET: A collection of file descriptors of interest in readable events.

  • Fd_set *writeset: A collection of file descriptors interested in writable events.

  • Fd_set * EXCEPtSet: Set of file descriptors interested in writable events.

The fd_set is the file descriptor array we mentioned earlier, which is a BitMap structure.

  • Const struct timeval *timeout:Select system call timeout, during which time the kernel finds noIO readyThe file descriptor is returned directly.

As mentioned in the previous section, after the kernel traverses the fd array and finds an IO-ready FD, it sets the value of the BitMap corresponding to the FD to 1 and returns the modified FD array to the user thread.

In the user thread, you need to iterate through the FD array to find the IO-ready FD, and then make the actual read/write call.

Here are the apis we need to use to iterate through the FD array in the user thread:

  • Void FD_ZERO(fd_set * fdSET) : Clears the specified set of file descriptors, that is, fd_set contains no file descriptors.

  • Void FD_SET(int fd, FD_SET *fdset) : Adds a given file descriptor to the collection.

The file descriptor is reset with FD_ZERO and FD_SET before each call to SELECT because the file descriptor set is modified in the kernel.

  • Int FD_ISSET(int fd, fd_set * fdSET) : Checks whether the file descriptor specified in the collection can be read or written. The user thread iterates through the set of file descriptors and calls this method to check that the corresponding file descriptor is IO ready.

  • Void FD_CLR(int fd, fd_set *fdset) : Removes a given file descriptor from the collection

Performance overhead

Although SELECT solved the problem of frequent system calls in the non-blocking IO model, we saw some shortcomings in select throughout the process.

  • The user thread incurs a user-to-kernel and kernel-to-user context switch overhead when making and returning the SELECT system call. Two context switches occurred

  • The user thread in kernel mode needs to copy the set of file descriptors from user space to kernel space when making and returning the SELECT system call. And copying the file descriptor collection from kernel space to user space after the kernel has modified it. Two copies of the file descriptor collection occurred

  • Polling in kernel space instead of user space is optimized, but select does not tell the user thread which Socket is actually iO-ready. Instead, it marks the IO-ready Socket, and the user thread still has to traverse the set of file descriptors to find the specific IO-ready Socket. The time is still O(n).

In most cases, network connections are not always active. If select listens for a large number of client connections, only a few connections are active. However, polling becomes less and less efficient as the number of connections increases.

  • The kernel modifies the original set of file descriptors. This results in the need to reset the file descriptor collection each time a SELECT call is re-initiated in user space.

  • BitMap file descriptor set, the length is fixed 1024, so only 0 to 1023 file descriptors can be listened.

  • The SELECT system call is not thread-safe.

The performance overhead of the above select deficiencies increases linearly with the amount of concurrency.

Obviously select does not solve the C10K problem either, and is only suitable for 1000 or so concurrent connection scenarios.

poll

Poll is like an improved version of SELECT, but it basically works the same way as SELECT.

int poll(struct pollfd *fds, unsigned int nfds, int timeout)
Copy the code
struct pollfd { int fd; /* File descriptor */ short events; /* Events to listen on */ short revents; /* The actual event is set by the kernel */};Copy the code

The set of file descriptors used in select is the fD_set of a BitMap with a fixed length of 1024. Pollfd has no array of fixed length, so there is no limit on the maximum number of descriptors (as well as system file descriptors).

Poll improves the limit of select listening on 1024 file descriptors, but not in terms of performance. It’s not that different from select per se.

  • The file descriptor collection also needs to be polled in kernel space and user space, and the time complexity of finding IO ready sockets is still O(n).

  • It is also necessary to copy the entire set of large file descriptors back and forth between user space and kernel space, whether the file descriptors are ready or not. Their overhead increases linearly with the number of file descriptors.

  • Select, poll Each time a new socket is added or deleted, the whole new socket set is uploaded to the kernel.

Poll is also not suitable for high concurrency scenarios. Still can’t solve the C10K problem.

epoll

From the introduction of the core principle of SELECT and poll, we can see that the performance bottleneck of SELECT and poll is mainly reflected in the following three areas:

  • Since the kernel does not store the set of sockets we want to listen for, the full set of socket file descriptors needs to be passed in and out each time a select,poll call is made. This results in a large number of file descriptors being copied back and forth between user space and kernel space frequently.

  • Since the kernel does not notify specific IO-ready sockets, but only flags those IO-ready sockets, when the SELECT system call returns, the set of socket file descriptors still needs to be fully traversed in user space to get the specific IO-ready socket.

  • In kernel space, I/O – ready sockets are also traversed.

Here’s how epoll solves these problems. Before introducing the core principles of epoll, we need to introduce some core basic knowledge needed to understand the working process of epoll.

The creation of a Socket

The server thread starts to block after the accept system call. When a client connects and completes the TCP three-way handshake, the kernel creates a corresponding Socket as the kernel interface for the server to communicate with the client.

From the Linux kernel’s point of view, everything is a file, and sockets are no exception. When the kernel creates a Socket, it manages the Socket in a list of open files for the current process.

What are the kernel data structures associated with processes managing these open file lists? After understanding these data structures, we will have a clearer understanding of the role Socket plays in the kernel. And it will be very helpful for us to understand the creation process of epoll.

Manage file list structure in process

Struct tast_struct is a data structure in the kernel that represents a process and contains all the information about that process. In this section we list only properties related to file management.

All files opened in the process are organized and managed by an array fD_array, the subscript of which is the file descriptor we often mention, and the array stores the corresponding file data structure struct file. Each time a file is opened, the kernel creates a struct file corresponding to it and finds a free location in the fd_array to allocate to it. The corresponding subscript in the array is the file descriptor we use in user space.

By default, for any process, file descriptor 0 represents stDIN standard input, file descriptor 1 represents STdOUT standard output, and file descriptor 2 represents STderr standard error output.

Fd_array is defined in the kernel data structure struct files_struct. In the struct FDtable structure, there is a pointer to fd_array, struct fd **fd.

Since this section discusses the data structure of the kernel network system, the Socket file type is used as an example:

The private_data pointer in the kernel data structure struct file used to encapsulate file meta-information points to a specific Socket structure.

Struct file file_operations attribute defines the file operation function, different file types, corresponding file_operations are different, for Socket file types, Here file_operations points to socket_file_ops.

Socket_file_ops is the first system call we make to the Socket in user space.

For example, when a write operation is performed on a Socket, sock_write_iter defined in socket_file_ops is called first in the kernel. The Socket initiates a read operation. In the kernel, the corresponding operation is sock_read_iter.


static const struct file_operations socket_file_ops = {
  .owner =  THIS_MODULE,
  .llseek =  no_llseek,
  .read_iter =  sock_read_iter,
  .write_iter =  sock_write_iter,
  .poll =    sock_poll,
  .unlocked_ioctl = sock_ioctl,
  .mmap =    sock_mmap,
  .release =  sock_close,
  .fasync =  sock_fasync,
  .sendpage =  sock_sendpage,
  .splice_write = generic_splice_sendpage,
  .splice_read =  sock_splice_read,
};
Copy the code

Socket kernel structure

When we write the network program, we will first create a Socket, and then bind and listen based on this Socket. We will call this Socket as a listening Socket.

  1. When we callacceptAfter, the kernel will be based onListening SocketCreate a new oneSocketIt is used for network communication with clients. And will beListening SocketIn theSet of Socket manipulation functions(inet_stream_ops )opsAssign to newSockettheopsAttribute.
const struct proto_ops inet_stream_ops = {
  .bind = inet_bind,
  .connect = inet_stream_connect,
  .accept = inet_accept,
  .poll = tcp_poll,
  .listen = inet_listen,
  .sendmsg = inet_sendmsg,
  .recvmsg = inet_recvmsg,
  ......
}
Copy the code

Note that the listening socket and the actual socket used for network communication are two sockets, one called listening socket and the other called connected socket.

  1. And then the kernel will be zeroConnected Socketcreatestruct fileAnd initialize the Socket file to operate on the set of functions (socket_file_ops Assigned tostruct fileIn thef_opsPointer. thenstruct socketIn thefilePointer to this newly allocated requeststruct fileStructure.

The kernel maintains two queues:

  • One is doneTCP three-way handshake, the connection state is inestablishedThe connection queue of. The kernel foricsk_accept_queue.
  • One is that it’s not done yetTCP three-way handshake, the connection state is insyn_rcvdHalf connection queue.
  1. And then callsocket->ops->accept, fromSocket kernel structure diagramWe can see that what is actually called isinet_accept, the function will be inicsk_accept_queueTo see if there are already established connections, and if so, directly fromicsk_accept_queueTo get the created onestruct sock. And will thestruct sockObject assigned tostruct socketIn thesockPointer.

Struct sock is a very core kernel object in struct socket, which defines the receive queue, send queue, wait queue, data ready callback function pointer and kernel protocol stack operation function set mentioned in the process of receiving and sending network packets

  • According to createSocketWhen the system call is madesock_createIn theprotocolParameter (forTCP protocolThe parameter value here is zeroSOCK_STREAM Find a collection of action method implementations defined for TCPinet_stream_ops tcp_prot. And set them separately tosocket->opsandsock->sk_prot On.

Look back at the Socket Kernel Diagram at the beginning of this section to see how they relate to each other.

Socket-related operation interfaces are defined in the set of inet_STREAM_ops functions that provide interfaces to users on a peer. The operation interface between socket and kernel protocol stack is defined in the SK_PROt pointer in struct sock, which points to the tcp_prot protocol operation function set.

struct proto tcp_prot = {
  .name      = "TCP",
  .owner      = THIS_MODULE,
  .close      = tcp_close,
  .connect    = tcp_v4_connect,
  .disconnect    = tcp_disconnect,
  .accept      = inet_csk_accept,
  .keepalive    = tcp_set_keepalive,
  .recvmsg    = tcp_recvmsg,
  .sendmsg    = tcp_sendmsg,
  .backlog_rcv    = tcp_v4_do_rcv,
   ......
}
Copy the code

The system IO call to the Socket will first call the file_operations file in the Socket’s struct file. Then call the inet_STREAM_opssocket operation function pointed to by OPS in struct socket, and finally call the tcp_PROt kernel stack operation function interface set pointed to by SK_PROt pointer in struct sock.

  • Set the sk_datA_ready function pointer in the struct sock object to sock_def_readable, which the kernel will call back when the Socket data is ready.

  • The wait queue in struct SOCK stores the process FD blocking system IO call and the corresponding callback function. Remember this place and we’ll mention it later when we introduce Epoll!

  1. whenstruct file.struct socket.struct sockAfter these core kernel objects are created, the final step is to put thesocketObject correspondingstruct fileTo the list of files that the process has openfd_arrayIn the. Subsequent system callacceptreturnsocketFile descriptor offdGive the user the program.

Blocking User process blocking and wake up mechanism in I/O

As mentioned in the previous section, when a user process makes a system I/O call, for example, read, the user process checks the Socket buffer for incoming data in kernel mode.

  • SocketIf there is data in the receive buffer, copy the data toThe user space, the system call returns.
  • SocketIf there is no data in the receive buffer, the user process relinquishesCPUEnter theThe blocking stateWhen the data reaches the receive buffer, the user process is woken up fromThe blocking stateEnter theThe ready stateWaiting for CPU scheduling.

In this section we will look at how user processes block on the Socket and how they are woken up on the Socket. Understanding this process is important to understand epoll’s event notification process

  • First we have the user processSocketforreadWhen a system call is made, the user process is called fromUser modetoKernel mode.
  • In the process ofstruct task_structThe structure to findfd_arrayAnd according to theSocketFile descriptor offdFind the correspondingstruct file, the callstruct fileFile manipulation functions infile_operations.readThe system call corresponds tosock_read_iter.
  • insock_read_iterIn the functionstruct filePoint to thestruct socketAnd callsocket->ops->recvmsgSo here we know we’re calling thetainet_stream_opsDefined in the collectioninet_recvmsg.
  • ininet_recvmsgWill find in thestruct sockAnd callsock->skprot->recvmsgHere, the call istcp_protDefined in the collectiontcp_recvmsgFunction.

For the whole call process, please refer to the system IO Call Structure Diagram above.

Now that we are familiar with the kernel function call stack, let’s take a look at the system IO call intcp_recvmsgHow do kernel functions block user processes

int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  size_t len, int nonblock, int flags, int *addr_len)
{... Omit non-core code...............// Access the receive queue defined in the SOCK objectskb_queue_walk(&sk->sk_receive_queue, skb) { ................. Omit non-core code...............Call sk_wait_data to block the current process if not enough data is received
  sk_wait_data(sk, &timeo);
}
Copy the code
int sk_wait_data(struct sock *sk, long *timeo)
{
 // create an element wait_queue_t on the wait queue in struct sock
 // associate the process descriptor and the callback function autoremove_wake_function to wait_queue_t
 DEFINE_WAIT(wait);

 // Call sk_sleep to get the wait_queue_head_t header pointer for the sock object
 // Call prepare_to_WAIT to insert the newly created wait item wait_queue_t into the wait queue and set the process state to INTERRUPTIBLE
 prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);

 // Release the CPU by calling schedule_timeout and then sleep, resulting in a context switchrc = sk_wait_event(sk, timeo, ! skb_queue_empty(&sk->sk_receive_queue)); .Copy the code
  • In the firstDEFINE_WAITCreated in thestruct sockThe wait type on the wait queue inwait_queue_t.
#define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function)

#define DEFINE_WAIT_FUNC(name, function)    \
 wait_queue_t name = {      \
  .private = current,    \
  .func  = function,    \
  .task_list = LIST_HEAD_INIT((name).task_list), \
 }
Copy the code

The private in wait type wait_queue_t is used to associate the user process FD that blocks on the current socket. Func is used to associate callback functions registered on wait items. Autoremove_wake_function is registered here.

  • Call sk_sleep(SK) to get the wait_queue_head_t pointer in the struct sock object.

  • The prepare_to_WAIT call inserts the newly created wait item wait_queue_t into the wait queue and sets the process to interrupt INTERRUPTIBL.

  • Sk_wait_event is called to release the CPU and the process goes to sleep.

The key is to remember the structure of the wait type wait_queue_t on the wait queue defined in struct sock. We’ll use it again later in the introduction to Epoll.

Let’s move on to how the user process is woken up when the data is ready

At the beginning of this article, we mentioned in the section “Network Packet Receiving Process” :

  • When the network packet arrives at the nic, the NIC passesDMATo put the data intoRingBufferIn the.
  • It then issues a hard interrupt to the CPU, which is created in the hard interrupt respondersk_bufferAnd copy the network data tosk_bufferIn the.
  • The kernel thread then initiates a soft interruptksoftirqd Call in response to a soft interruptThe poll functionwillsk_bufferSend to the kernel protocol stack for layer upon layer protocol processing.
  • At the transport layerTcp_rcv function, remove the TCP header, according toQuad (source IP, source port, Destination IP, destination port)Find the correspondingSocket.
  • The final will besk_bufferIn theSocketIn the receive queue.

This is the complete process of the kernel receiving network data. Let’s look at how the user process is woken up after the packet is received.

  • When the soft interrupt places sk_buffer on the Socket’s receive queue, it then calls the data ready function callback pointer SK_datA_ready, which, as mentioned earlier, points to the sock_def_readable function when initialized.

  • The socket->sock-> SK_wq wait queue is fetched in the sock_def_readable function. Wake_up_common finds a wait_queue_t from the wait queue sk_wq and calls the func callback (wait_queue_t->func) registered on the wait_queue_t. The registered callback function is autoremove_wake_function.

Even if multiple processes are blocked on the same socket, only one process is awakened. Its purpose is to avoid stampedes.

  • inautoremove_wake_function Function, according to the wait itemwait_queue_tOn theprivateThe associatedBlocking process fdcalltry_to_wake_upWake up blocking on theSocketOn the process.

Remember the func function pointer in wait_queue_t, where epoll’s callback function is registered in epoll.

Now that we’ve covered the basics of understanding epoll, with so much chatter, it’s time to finally get to the topic of this section, epoll.

Epoll_create Creates an epoll object

Epoll_create is a system call that the kernel gives us to create an epoll object. When we call epoll_CREATE in the user process, the kernel creates a struct eventPoll object for us and has a corresponding struct file associated with it. The struct file associated with the struct eventPoll object also needs to be managed in the process’s open file list fd_array.

Familiar with Socket creation logic, epoll creation logic is not difficult to understand.

The file_operations pointer in the struct file associated with the struct EventPoll object points to a collection of eventPoll_fops operation functions.

static const struct file_operations eventpoll_fops = {
     .release = ep_eventpoll_release;
     .poll = ep_eventpoll_poll,
}
Copy the code

struct eventpoll {

    // Wait queue, where processes blocking on epoll are placed
    wait_queue_head_t wq;

    // Ready queue, where iO-ready socket connections are placed
    struct list_head rdllist;

    // The red-black tree is used to manage all monitored socket connections
    struct rb_root rbr;. }Copy the code
  • Wait_queue_head_t wq:The wait queue in epoll, which holds theblockinginepollThe user process on. inIO readywhenepollYou can find these through this queueblockingProcess and wake them up to executeIO callRead and writeSocketData on.

Note here and Socket wait queue distinction!!

  • Struct list_head rdllist:The ready queue in epoll, and the queue holds bothIO readytheSocketThe awakened user process can read the queue directlyIO activetheSocket. You don’t have to go through the whole thing againSocketCollection.

This is where epoll is more efficient than select poll, which returns all socket connections. We need to iterate in user space to find the socket connections that are actually IO active.

Epoll only returns Socket connections that are IO active. User processes can directly perform I/O operations.

  • struct rb_root rbr :Because the red black tree is inTo find the.insert.deleteEpoll uses a red-black tree internally to manage a large number of dataSocketThe connection.

Select manages connections with arrays, poll manages connections with linked lists.

Epoll_ctl adds the listening Socket to the epoll object

After we call epoll_create to create the epoll object struct EventPoll in the kernel, we can use epoll_ctl to add the Socket connections we need to manage to epoll.

  1. The first step is to create a representation in the epoll kernelThe Socket connectionData structure ofstruct epitem , and inepollIn order to consider the comprehensive performance, a red black tree is used to manage these massiveThe socket connection. sostruct epitem Is a red-black tree node.

struct epitem
{
      // Points to the owning epoll object
      struct eventpoll *ep; 
      // Register the event of interest, that is, user-space epoll_event
      struct epoll_event event; 
      // Points to the ready queue in the epoll object
      struct list_head rdllink;  
      // points to the corresponding red-black tree node in epoll
      struct rb_node rbn;     
      // refer to the socket->file structure represented by epitem and the corresponding fd
      struct epoll_filefd ffd;                  
  }
Copy the code

Remember the rDLLink and epoll_filefd members of the struct epItem structure, which we’ll use later.

  1. The presentation is created in the kernelThe Socket connectionData structure ofstruct epitemAfter that, we need to be inSocketCreate a wait item on the wait queue inwait_queue_tAnd registeredEpoll's callback function, ep_poll_callback.

From the introduction to user process blocking and Awakening in Blocking IO, I think you can already guess the significance of this step. The autoremove_wake_function callback was registered in the wait_queue_t. Remember?

Epoll’s callback function, ep_poll_callback, is the core of epoll’s synchronous IO event notification mechanism and the fundamental performance difference between Poll and SELECT.

Here’s a new data structurestruct eppoll_entrySo what does it do? You can first guess its role with the above picture!

We know that the socket->sock->sk_wq wait queue is of type wait_queue_t. We need to register the epoll callback ep_poll_callback function on the socket’s wait queue represented by struct epitem.

The kernel calls back to SK_datA_ready when the data reaches the receive queue in the socket. In the section on user process blocking and wake up in blocking IO, we know that the pointer to the SK_data_ready function points to the sk_def_readable function. The wait item wait_queue_t -> func callback ep_poll_callback registered in the wait queue is called back in SK_def_readable. The EPItem needs to be found in the EP_poll_callback, and the IO ready EPItem is placed in the ready queue in the epoll.

The socket wait queue whose type is wait_queue_t cannot be associated with an EPItem. Hence the struct eppoll_entry structure, which associates wait_queue_t and EPitem in the Socket wait queue.

struct eppoll_entry { 
   // Point to the associated epItem
   struct epitem *base; 

  Private = null func = ep_poll_callback
   wait_queue_t wait;   

   // Listen for the wait queue header pointer in the socket
   wait_queue_head_t*whead; . };Copy the code

In the ep_poll_callback callback function, eppoll_entry can be found using the container_of macro based on the wait item in the Socket wait queue, and then epitem can be found.

Container_of is a macro commonly used in the Linux kernel to obtain a pointer to a structure from a pointer contained in the structure itself.

Note that the private value of wait_queue_t is set to NULL because the Socket is managed by epoll and processes blocked on the Socket are woken by epoll. The func registered for wait_queue_t is ep_poll_callback instead of autoremove_wake_function. The blocking process does not need autoremove_wake_function to wake up. So set private to NULL here

  1. When inSocketThe wait item is created in the wait queuewait_queue_tAnd registeredepollThe callback function ofep_poll_callback “And then passed againeppoll_entryAssociated with theepitemAfter.

All that is left to do is insert the EpItem into the red-black tree struct rb_root RBR in epoll.

Here you can see another optimization of EPoll. Epoll centrally manages all socket connections through a red-black tree in the kernel. Each addition or deletion of a socket connection is an incremental addition or deletion, as opposed to a full set of socket connections passed into the kernel on each call like select or poll. Frequent and large memory copies are avoided.

Epoll_wait Synchronously blocks the Socket for obtaining I/O readiness

  1. After a user program calls epoll_wait, the kernel first looks for the ready queue eventPoll -> rdlList in epoll to see if it has an IO ready EPItem. The EPItem encapsulates information about the socket. If there is a ready EPItem in the ready queue, the ready socket information is encapsulated in the epoll_event return.

  2. If there is no IO-ready EPItem in the eventpoll-> rDLList ready queue, the wait item wait_queue_t is created to associate the user process’s FD with wait_queue_t->private. And register the callback default_wake_function on wait_queue_t->func. Finally, the wait item is added to the wait queue in epoll. The user process relinquishes the CPU and enters the blocked state.

The blocking principle is the same as in the blocking IO model, except that in the blocking IO model the autoremove_wake_function is registered on wait_queue_t->func and the wait item is added to the wait queue in the socket. What is registered here is default_wake_function, which adds the wait item to the wait queue in epoll.

  1. Having done so much knowledge foreshadowing, the following is finally hereepollThe entire workflow of:

  • When a network packet reaches the socket’s receive buffer after being processed by the kernel protocol stack in a soft interrupt, the socket’s data ready callback pointer sk_DATA_ready is immediately called with the sock_def_readable callback function. Find the wait item in the socket’s wait queue, where the registered callback function is ep_poll_callback.

  • In the callback function ep_poll_callback, Use the container_of macro to locate the eppoll_entry object and use its base pointer to locate the data structure struct epitem that encapsulates the socket. And add it to the ready queue RDLList in epoll.

  • It then checks to see if there are any wait items in the wait queue in epoll, that is, to see if any process is blocking the socket waiting for I/O readiness on epoll_wait. If there are no wait items, soft interrupt processing is complete.

  • If there is a wait item, it returns to the callback function default_wake_function registered in the wait item, wakes up the blocking process in the callback function, and wraps the IO ready socket information of the EPitem in the ready queue RDLlist into struct epoll_event.

  • The user process gets the SOCKET with the Epoll_event IO ready socket and initiates the system I/O call to read data.

Talk about horizontal triggers and edge triggers

There are a lot of explanations about these two modes on the Internet, most of which are vague and seem to be forced conceptually, which is hard to understand after reading. Therefore, the author would like to make his own interpretation of these two modes again based on the above epoll working process, and strive to clearly explain the similarities and differences between these two working modes.

After the detailed interpretation of epoll’s working process, we know that when data is coming from the socket we are listening to, The soft interrupt executes epoll’s callback function, ep_poll_callback, in which the data structure EPitem describing socket information in epoll is inserted into the ready queue RDLList in epoll. The user process is then woken up from the epoll wait queue, epoll_WAIT returns the IO-ready socket to the user process, and epoll_WAIT clears the RDLList.

The key difference between horizontal firing and edge firing is when there is still data to read in the receive buffer of the socket. Epoll_wait Whether to clear the RDLList.

  • Horizontal trigger: In this mode, the user thread invokes epoll_wait to obtain the IO-ready socket, and then invokes epoll_wait again to read data from the socket. Epoll_wait checks if there is any data left in the receive buffer of these sockets, and if there is, puts the Socket back into the RDLList. Therefore, if the IO on the socket has not been processed, the user process can continue to process the I/O events on the socket by calling epoll_wait again.

  • Edge trigger: In this mode, epoll_wait clears the RDLList directly, regardless of whether there is any data left to read on the socket. So in edge-triggered mode, when you have not had time to process the remaining readable data in the socket receive buffer, epoll_wait is called again. Because the RDList has been cleared, the socket will not be returned from epoll_WAIT again, so the user process will not get the socket again. I can’t do IO to it anymore. Unless new I/O data arrives on the socket, the socket will be added to the RDLList again according to the epoll process.

If you process some data on a socket in edge-triggered mode, you will have to wait until network data arrives on that socket again.

EpollSocketChannel implemented in Netty uses edge-triggered mode by default. The NIO of the JDK defaults to horizontal trigger mode.

Epoll Optimization summary of select, poll

  • epollPass through the kernelRed and black treeManage a huge number of connections, so call inepoll_waitTo obtainIO readyDoes not need to pass in the listening socket file descriptor. Thus avoiding the massive collection of file descriptors inThe user spaceandThe kernel spaceTo copy back and forth.

Select, poll requires passing the full set of file descriptors each time it is called, resulting in a large number of frequent copy operations.

  • epollOnly will informIO readyThe socket. Avoids the overhead of traversing user space.

Select, poll, poll, poll, poll, poll, poll, poll, poll, poll, poll, poll

  • epollThrough thesocketRegister the callback function on the wait queueep_poll_callback Notify user programIO readyThe socket. Avoids the overhead of polling in the kernel.

In most cases, the SOCKET is not always IO active. In the face of a large number of connections, select and poll use the kernel polling method to obtain the IO active socket, which is undoubtedly the core reason for the low performance.

According to the performance advantages mentioned above, Epoll is by far the network IO model used by major network frameworks and reverse proxy middleware.

The C10K problem can be solved easily by using epoll multiplexing IO model.

The C100k solution is also based on the C10K solution through ePoll with thread pools, plus CPU, memory, and network interface performance and capacity improvements. For the most part, C100K comes naturally.

Even the C1000K solution is essentially built on ePoll’s multiplexed I/O model. However, in addition to the I/O model, there is a need for deep optimization at all levels, from the application to the Linux kernel to the CPU, memory, and network, especially with the hardware to unload a lot of functionality that was previously handled by the software (removing a lot of interrupt response overhead, as well as the kernel protocol stack processing overhead).

Signal driven IO

Everyone is familiar with this device, when we go to some food restaurants to eat, after ordering and paying, the boss will give us a signal. And then we take this signal and we can go find the dinner table or whatever. When the signal lights up, it means the meal is ready and we can go to the window to get the meal.

This typical life scenario is very similar to the signal-driven IO model we will introduce.

In the signal-driven IO model, user process operation initiates an IO request through system call sigAction function and registers a signal callback in the corresponding socket. At this time, user process will not be blocked and the process will continue to work. When the kernel data is ready, the kernel generates a SIGIO signal for the process and notifies the process to perform relevant IO operations through the signal callback.

Note here: signal-driven IO model is still synchronous IO, because it can wait for data without blocking, nor frequent polling, but when the data is ready, after the kernel signal notification, the user process still has to read the data itself, blocking in the data copy phase.

Compared with the previous three IO models, the signal-driven IO model realizes that the process is not blocked while waiting for data to be ready, and the main loop can continue to work, so the theoretical performance is better.

In practice, the signal-driven IO model is rarely adopted when TCP is used for communication. Here’s why:

  • Signal I/O may fail to be notified when a large number of I/O operations occur due to signal queue overflow
  • SIGIO signalIs a Unix signal that has no additional information. If a signal source has multiple causes for the signal, the receiver cannot determine what is going on. And TCP socket production of signal events have seven, so the application received SIGIO, there is no way to distinguish processing.

The signal-driven IO model can be used for UDP communications, however, because UDP has only one data request event, which means that normally UDP processes will call the READ system call to read the incoming data as soon as the SIGIO signal is captured. If an exception occurs, an exception error is returned.


As an aside, do you think the example of blocking the IO model in life is like when we’re standing in line at the cafeteria. You have to wait in line to get the food and you have to wait for the chef to serve the food.

IO multiplexing is like waiting in line at a restaurant. The caller is like select,poll, and epoll, which can uniformly manage the meal ready events of all customers. Customers are like socket connections, who can go to eat, the caller will notify who.

Asynchronous IO (AIO)

The four IO models described above are all synchronous IO, and all of them block in the second data copy phase.

The asynchronous IO model is easy to understand from the previous section “Synchronization and Asynchrony”. In the asynchronous IO model, IO operations are performed by the kernel in the data preparation phase and the data copy phase without causing any blocking to the application. The application only needs to reference the data in the specified array.

The main difference between asynchronous IO and signal-driven IO is that signal-driven IO tells the kernel when an I/O operation can be started, while asynchronous IO tells the kernel when an I/O operation has been completed.

For an example of life: asynchronous I/o model as we are of the compartment to a fancy restaurant for dinner, we just need sitting room inside, point (analogy) asynchronous I/o call after your meal, we what also don’t need to pipe, the drink to drink, chat in the chat, the waiter after meals prepared (kernel) analogy will give us the rooms (analog user space). There were no blockages.

The system call of asynchronous I/O needs the support of the operating system kernel. At present, only the IOCP in Window realizes the very mature asynchronous I/O mechanism.

Linux is not mature enough to implement asynchronous I/O mechanism, and the performance improvement is not obvious compared with NIO.

However, the New asynchronous IO library IO_uring, introduced by Facebook’s Jens Axboe in version 5.1, ameliorates some of the performance issues of the original Linux Native AIO. Performance improvements over Epoll and previous native AIO are noteworthy.

In addition, the signal-driven IO model is not applicable to TCP protocol, so most of the IO multiplexing model is used at present.

IO thread model

In the introduction of the previous content, we detail the process of receiving and sending network packets, and understand how the kernel reads network data and notifies the user thread by introducing five IO models.

The previous content analyzes the sending and receiving model of network data from the perspective of kernel space. In this section, we look at how to send and receive network data from the perspective of user space.

The user-space IO thread model is relatively simple compared to the kernel. These user-space IO thread models discuss who is responsible for receiving connections, who is responsible for responding to IO reads and writes, who is responsible for computing, and who is responsible for sending and receiving when multiple threads work together.

Reactor

Reactor uses NIO to divide IO threads differently:

  • Use what we mentioned earlierIO multiplexing modelSuch asselect,poll,epoll,kqueueTo register and listen for I/O events.
  • To listen toReady I/O eventdistributiondispatch To each specific treatmentHandlerCarry out the correspondingI/O Event Processing.

I/O multiplexing allows you to constantly listen for I/O events and dispatch, just like a Reactor, which looks like it’s constantly generating I/O events, so we call this the Reactor model.

Let’s look at the three classifications of the Reactor model:

Single Reactor Single thread

The Reactor model relies on THE I/O multiplexing technology to monitor I/O events and continuously generate I/O ready events. In Linux system, we use Epoll for I/O multiplexing. We take Linux system as an example:

  • singleReactorThat means there’s only oneepollObject to listen for all events, such asConnection event.Read and write event.
  • Single threadMeans there is only one thread to executeepoll_waitTo obtainIO readytheSocketAnd then for those readySocketIt is the same thread that performs reads and writes, and subsequent business processing.

Single Reactor single thread model is just like we had a tiny small restaurant, as a boss, we need a person to do all the things, including: to meet customer (accept events), and introduce the menu for the customer waiting for customers to order request (IO), cook (business process), serving (IO response), fujian (disconnected).

Single Reactor multithreading

As the number of customers increased (concurrent requests), it became clear that the restaurant was too busy for us to handle alone (single-threaded), so we needed to hire more people (multi-threaded) to help with the above tasks.

Hence the single-reactor multithreaded model:

  • In this case, there’s only oneepollObject to listen on allIO events, a thread to callepoll_waitTo obtainIO readytheSocket.
  • But whenIO ready eventWhen these are generatedIO eventsServices to be processedHandler, we do it through a thread pool. Compared to thisSingle Reactor Single threadThe model improves the execution efficiency and gives full play to the advantages of multi-core CPU.

Reactor is multithreaded

In everything we do, we should prioritize things. We should prioritize things that are higher priority and do them efficiently instead of doing them all at once.

When we have more small restaurant guests (concurrency value increasing), we need to expand the size of the hotel in the process, we found that meet the guest hotel is the most important work, we need to meet our guests to come in, can’t let guests away from more than a judge of character, as long as the guests came in, it doesn’t matter if the food is a little more slowly.

Thus, the Reactor multithreaded model is generated:

  • We went from a single Reactor to a multiple Reactor. The primary Reactor is used to prioritize the highest priority activity, that is, to receive guests (to process connection events). The corresponding process Handler is the acceptor shown in the diagram.

  • After the connection is established and the socket is set up, the read events listened to in the acceptor are registered with the slave Reactor, which listens for read and write events on the socket.

  • Ultimately, the business logic processing of reading and writing is handed over to the thread pool.

Note: The Reactor registers only a read event, not a WRITE event, because the READ event is triggered by the Epoll kernel and the write event is triggered by the user business thread (which determines when to send data). So write events should be registered by the user business thread.

The user thread registers write events only when all the data sent by the user cannot be written to the buffer at one time. When the buffer is writable again, the user thread continues to write the rest of the sent data. If the user thread can write all the sent data to the buffer at one time, the user thread registers write events. There is no need to register write events to the Reactor.

The principal/Slave Reactor multithreading model is an IO thread model used in most mainstream network frameworks. Netty, the subject of this series, uses this model.

Proactor

Proactor is a model of division of IO threads based on AIO. We introduced the asynchronous IO model earlier. It is a fully asynchronous programming model supported by the operating system kernel, which has no blocking in the whole process of data preparation and data copy.

The ProactorIO thread model assigns the listening of IO events, the execution of IO operations, and the dispatch of IO results to the kernel.

Proactor modelComponents:

  • The Completion handler is a callback function defined by the user program for asynchronous I/O operations. Upon completion of an asynchronous I/O operation, the completion handler will be called back by the kernel and inform the I/O of the result.

  • Completion Event Queue After asynchronous I/O operations are complete, the corresponding I/O Completion events are generated and added to the Queue.

  • Asynchronous Operation Processor is responsible for executing Asynchronous I/OS. After the Completion of the execution, I/O Completion events are generated and placed in the Completion Event Queue.

  • Proactor is an Event loop dispatcher that is responsible for fetching I/O Completion events from the Completion Event Queue and calling back the Completion handler associated with the I/O Completion events.

  • Initiator initializes the asynchronous operation and registers the Completion handler and ProActor to the kernel through the Asynchronous operation Processor.

Proactor modelExecution process:

  • The user thread initiates aiO_read and tells the kernel the address of the read buffer in user space so that the kernel can complete the IO operation and put the result into the read buffer in user space. The user thread can read the result directly (without any blocking).

  • Initiator initializes the AIo_read asynchronous operation and registers the Completion Handler with the kernel.

The I/O completion event we care about in Proactor: the kernel has already read the data for us and put it into the read buffer we specify, which the user thread can read directly.

In Reactor we care about IO ready events: data has arrived, but needs to be read by the user thread.

  • The user thread can then do other things without waiting for the I/O result. The kernel starts asynchronously performing IO operations at the same time. A Completion Event is generated when the I/O operation completes and added to the Completion Event Queue.

  • Proactor retrieves the Completion event from the Completion Event Queue and calls back the Completion handler associated with the IO completion event.

  • Complete the business logic processing in the Completion Handler.

Compare Reactor and Proactor

  • ReactorIs based onNIOOne kind of implementationIO thread model.ProactorIs based onAIO

Implementation of the IO thread model.

  • Reactor is concerned with IO ready events, and Proactor is concerned with IO complete events.

  • In Proactor, user programs need to pass user-space read buffer addresses to the kernel. Reactor does not. As a result, each concurrent operation in Proactor requires a separate cache, which incurs a certain amount of memory overhead.

  • The implementation logic of Proactor is complex and the coding cost is much higher than Reactor.

  • Proactor performs better than Reactor in processing high time consuming I/O, but it does not significantly improve the IO execution efficiency in low time consuming I/O.

Netty IO model

Now that we have covered the process of sending and receiving network packets in the kernel and the five IO models and two IO thread models, let’s look at what the IO model in Netty looks like.

When we introduced the Reactor IO threading model, we mentioned that there are three kinds of Reactor models: single Reactor single thread, single Reactor multi-thread, and master/slave Reactor multi-thread.

All three types of Reactor models are supported in NetTY, but we usually use the master-slave Reactor multithreading model.

The three reactors we introduced before are only models and design ideas. In fact, various network frameworks in the implementation is not strictly in accordance with the model to achieve, there will be some small differences, but the general design idea is the same.

What is the master/slave Reactor model for netty?

  • Reactor appears in the form of group in NetTY, which divides Reactor into two groups. One is the MainReactorGroup, which is the EventLoopGroup that we see a lot in coding, the bossGroup, and the other is the SubReactorGroup, which is the EventLoopGroup workerGroup that we see a lot in coding.

  • There is usually only one Reactor in the MainReactorGroup that does the most important thing, which is listening for connection accept events. When a connection event occurs, create and initialize the corresponding NioSocketChannel (representing a Socket connection) in the corresponding handler acceptor. Select a Reactor from the SubReactorGroup, register it, and listen for Read events.

The reason there is only one Reactor in the MainReactorGroup is that usually our server program will bind to listen on only one port. If we bind to listen on multiple ports, we will configure multiple reactors.

  • SubReactorGroup have multiple Reactor, the number of specific Reactor can by the system parameters – D io.net ty. EventLoopThreads specified. The default Reactor is number of CPU cores x 2. The reactors in the SubReactorGroup listen for read and write events. Each Reactor listens for a set of socket connections. Spread the full amount of connections across multiple reactors.

  • Each Reactor allocates an IO thread, which retrieves IO ready events from the Reactor, performs IO calls to retrieve IO data, and executes PipeLine.

A Socket connection is assigned to a Reactor after it is created, so a Socket connection is only executed by a fixed I/O thread. Each Socket connection is assigned a separate PipeLine instance that choreoses the I/O processing logic for the Socket connection. The purpose of this kind of lockless serialization is to prevent multiple threads from concurrently executing IO logic processing on the same socket connection, and to prevent thread safety problems. At the same time, maximize the system throughput

Since there is only one IO thread in each Reactor, the IO thread not only executes the ChannelHandler in the PipeLine corresponding to the IO active Socket connection, but also obtains the IO ready event from the Reactor to execute the IO call. Therefore, the logic executed in the ChannelHandler in the PipeLine should not consume too much time. Try to put the time-consuming business logic processing into a separate business thread pool for processing. Otherwise, I/O reading and writing of other connections will be affected, which further affects the IO throughput of the whole service program.

  • whenI/o requestAfter the corresponding business logic processing is completed in the business thread, the held in the business thread is utilizedChannelHandlerContextThe reference will respond to the data inPipeLine, and finally write back to the client.

The IO model for NetTY is finished. Let’s briefly introduce how NETty supports the three Reactor models mentioned above.

Configure a single Reactor and a single thread

EventLoopGroup eventGroup = new NioEventLoopGroup(1);
ServerBootstrap serverBootstrap = new ServerBootstrap(); 
serverBootstrap.group(eventGroup);
Copy the code

Configure a single Reactor with multiple threads

EventLoopGroup eventGroup = new NioEventLoopGroup();
ServerBootstrap serverBootstrap = new ServerBootstrap(); 
serverBootstrap.group(eventGroup);
Copy the code

Configure the primary and secondary Reactor multithreading

EventLoopGroup bossGroup = new NioEventLoopGroup(1); 
EventLoopGroup workerGroup = new NioEventLoopGroup();
ServerBootstrap serverBootstrap = new ServerBootstrap(); 
serverBootstrap.group(bossGroup, workerGroup);
Copy the code

conclusion

This article is a relatively large amount of information, with 25 pictures, 22336 words from the kernel how to deal with the network packet sending and receiving process, and then introduces the concepts of blocking and non-blocking, synchronous and asynchronous, which are often confused from the perspective of the kernel. With this as a foundation, we introduced five IO models through a C10K problem, and then introduced the principles of SELECT, Poll,epoll and their comprehensive comparison in the form of technical evolution in IO multiplexing. Finally, we introduce two IO thread models and Reactor model in Netty.

Thank you for listening to me here, ha ha, now you can rub your eyes, stretch, have a good rest.