This article starts with hardware, understanding the process of data transfer from the root, not bad, bookmarking.

The original link

Engaged in server-side development, it is necessary to contact network programming. Epoll is an essential technology for high-performance web servers under Linux. Nginx, Redis, Skynet, and most game servers use this multiplexing technology.

Epoll is important, but what’s the difference between epoll and SELECT? What makes epoll efficient?

Although there are many articles on epoll on the Internet, they are either too simple or caught in source code analysis, and few of them are easy to understand. I decided to write this article so that even readers without professional background can understand the principles of epoll.

The core idea of this article is to make readers clearly understand why Epoll has good performance.

This paper will start from the process of receiving data network card, connected with CPU interruption, operating system process scheduling knowledge; Step by step, the evolutionary process of block receiving data and SELECT to epoll is analyzed. Finally, the implementation details of epoll are explored.

One, from the network card to receive data

Below is a diagram of a typical computer structure. A computer consists of CPU, memory, and network interfaces. The first step to understanding the nature of epoll is to look at how a computer receives network data from a hardware point of view.

The following figure shows the process of the network card receiving data.

  • In phase 1, the network adapter receives data from the network cable.
  • The transmission of hardware circuit after stage ②;
  • The final stage ③ writes the data to an address in memory.

This process involves DMA transfer, IO path selection and other hardware related knowledge, but we only need to know: the nic will write the received data to memory.

The process by which a network adapter receives data

Through hardware transmission, the data received by the nic is stored in memory, which can be read by the operating system.

Two, how to know the received data?

The second step to understanding the nature of epoll is to look at data reception from the CPU’s point of view. To understand this problem, you need to understand the concept of interruption.

When a computer executes a program, it has a priority requirement. For example, when a computer receives a power outage signal, it should immediately save data, and the program that saves the data has high priority (capacitors can store a small amount of power to run the CPU for a short period of time).

In general, signals generated by the hardware need to be responded to immediately by the CPU, or the data may be lost, so they have a high priority. The CPU is supposed to interrupt the executing program to respond; When the CPU has finished responding to the hardware, the user program is re-executed. The interrupt process is similar to a function call, except that the function call is positioned in advance, and the interrupt location is determined by the “signal.”

Interrupt routine call

Take the keyboard as an example. When a user presses a key on the keyboard, the keyboard sends a high level to the INTERRUPT pin of the CPU. The CPU can pick up this signal and execute the keyboard interrupt program. The following figure shows how various hardware interact with the CPU through interrupts.

How do I know I received data?

3. Why does process blocking not occupy CPU resources?

The third step to understanding the nature of epoll is to look at data reception from the perspective of operating system process scheduling. Blocking is a key part of process scheduling. It refers to the waiting state of a process before an event (such as receiving network data) occurs. Recv, SELECT, and epoll are all blocking methods. Process blocking does not consume CPU resources.

To keep things simple, let’s start with a plain RECV receive, starting with the following code:

Int s = socket(AF_INET, SOCK_STREAM, 0); / / bindingbind(s, ...) // Listen (s,...) Int c = accept(s,...) Recv (c,...) ; // Print out the dataprintf(...).Copy the code

This is a most basic network programming code, first create a socket object, call bind, listen and accept, and finally call RECV to receive data. Recv is a blocking method, and when the program runs to RECV, it waits until it receives data before proceeding.

So how does blocking work?

The work queue

In order to support multitasking, the operating system realizes the function of process scheduling, which divides processes into several states, such as “running” and “waiting”. The running state is the state in which the process has CPU usage and is executing code. The wait state is a blocked state. For example, when the program runs to RECV, it changes from the running state to the wait state, and then changes back to the running state after receiving data. The operating system executes processes in various running states on a timeshare basis, so fast that it appears to be performing multiple tasks at once.

The computer in the figure below is running three processes A, B and C, of which process A executes the basic network program mentioned above. At the beginning, these three processes are referenced by the work queue of the operating system, and are in the running state, and will execute in time.

Waiting queue

When process A executes the statement that creates the socket, the operating system creates A socket object managed by the file system (as shown below). The socket object contains members of the send buffer, receive buffer, and wait queue. The wait queue is an important structure that points to all processes that need to wait for the socket event.

Wake up the process

When the socket receives the data, the operating system puts the process on the socket wait queue back on the work queue. The process becomes running and continues to execute the code. And since the socket’s receive buffer already has data, recV can return the received data.

The whole process of receiving network data by the kernel

This step, through the network card, interrupt and process scheduling knowledge, describes the whole process of blocking recV, the kernel receives data.

As shown in the figure below, during recV blocking, the computer receives data from the peer end (step 1), and the data is sent to memory via the nic (step 2). The NIC then notifies the CPU of the arrival of data through an interrupt signal, and the CPU executes the interrupt procedure (step 3).

The interrupt program here mainly has two functions. First, it writes network data to the receiving buffer of the corresponding socket (Step 4), wakes up process A (step 5), and puts process A back to the work queue.

The kernel receives data throughout the process

The process of waking up a process is shown below:

Wake up the process

This is the whole process of the kernel receiving data, here we may think about two questions:

  • First, how does the operating system know which socket the network data corresponds to?
  • Second, how to monitor data of multiple sockets at the same time? The first problem: Because a socket corresponds to a port number, and network packets contain information about IP and port, the kernel can find the socket based on the port number. Of course, to speed up processing, the operating system maintains an index structure of port numbers to sockets for fast reading.

The second problem is the key to multiplexing and is the focus of the second half of this paper.

A simple way to monitor multiple sockets at the same time

The server needs to manage multiple client connections, and recV can only monitor a single socket. With this contradiction, people began to look for ways to monitor multiple sockets. The essence of epoll is to monitor multiple sockets efficiently.

Historically, there must have been a less efficient method, and then people improved on it, just as SELECT was for epoll.

You can better understand the nature of epoll by understanding the less efficient SELECT.

If you can prepass in a list of sockets, if all the sockets in the list have no data, suspend the process until one socket receives data, and wake up the process. This approach is straightforward and is the design idea behind SELECT.

Let’s review the use of select for ease of understanding. In the following code, prepare an array FDS to hold all the sockets that need to be monitored. Select is then called. If all the sockets in the FDS have no data, the select blocks until one socket receives the data, and the select returns, waking up the process. The user can traverse FDS, determine which socket receives data through FD_ISSET, and then process the data.

int s = socket(AF_INET, SOCK_STREAM, 0);  
bind(s, ...) ; listen(s, ...) ; Int FDS [] = store socket to listen on;while(1){ int n = select(... , fds, ...)for(int i=0; i < fds.count; i++){
        if(FD_ISSET(fds[i], ...) {// FDS [I] data processing}}}Copy the code

Select the process

If the program monitors sock1, SOck2 and SOck3 sockets at the same time, then after calling SELECT, the operating system adds process A to the wait queue of these three sockets respectively.

The operating system adds process A to the wait queues of the three sockets

When any socket receives data, the interrupt routine invokes the process. The following figure shows the process of sock2 receiving data:

Note: RecV and SELECT interrupt callbacks can be set differently.

Sock2 receives data, and the interrupt routine invokes process A

To call up a process is to remove the process from all wait queues and add it to the work queue, as shown in the following figure:

Remove process A from all wait queues and add it to the work queue

Through these steps, when process A wakes up, it knows that at least one socket received data. The program simply iterates through the socket list to get the socket ready.

This simple approach works well and is implemented on almost every operating system.

But simple methods often have disadvantages, mainly:

First, each call to SELECT requires the process to be added to all wait queues that monitor sockets, and each wake up needs to be removed from each queue. This involves two traversals, each time passing the entire FDS list to the kernel, which has some overhead. Because the traversal operation is expensive, the maximum number of select monitors is set for efficiency purposes. By default, only 1024 sockets can be monitored.

Second, after the process is woken up, the program does not know which sockets received data and needs to traverse again.

So, is there a way to reduce traversal? Is there a way to save ready sockets? These two problems are to be solved by epoll technology.

A sidebar: This section explains only one case of SELECT. When a program calls select, the kernel iterates over the socket first. If more than one socket receives data from the buffer, select returns without blocking. This is one of the reasons why a select return value may be greater than 1. If no socket has data, the process will block.

6. Design ideas of Epoll

Epoll was invented many years after Select, and is an enhanced version of Select and Poll (poll and Select are basically the same, with minor improvements). Epoll improves efficiency by:

Measure one: function separation

One of the reasons select is inefficient is that the two steps of “maintaining the wait queue” and “blocking the process” are rolled into one. As shown in the figure below, these two steps are required for each call to select. However, in most application scenarios, the socket to be monitored is relatively fixed and does not need to be changed every time. Epoll separates the two operations by using epoll_ctl to maintain the wait queue and then calling epoll_WAIT to block the process. Obviously, efficiency can be improved.

Let’s take a look at the use of epoll to understand what follows. Epoll_create creates an epoll object epfd, uses epoll_ctl to add the socket to the epFD, and calls epoll_wait to wait for the data:

int s = socket(AF_INET, SOCK_STREAM, 0);   
bind(s, ...) listen(s, ...) int epfd = epoll_create(...) ; epoll_ctl(epfd, ...) ; // Add all sockets that need to be listened to to epFDwhile(1){
    int n = epoll_wait(...)
    for(Socket receiving data){// processing}}Copy the code

The separation of functions makes it possible to optimize Epoll.

### Step 2: Ready list

Another reason for select’s inefficiency is that the program does not know which sockets receive data and can only iterate through them one by one. Traversal can be avoided if the kernel maintains a “ready list” that refers to the socket receiving the data. As shown in the figure below, the computer has three sockets, and sock2 and SOck3 that receive data are referenced by the ready list RDList. When the process is woken up, it can know which sockets received data simply by retrieving the rdList contents.

Ready list schematic

Vii. Principle and working process of Epoll

This section uses examples and diagrams to illustrate how epoll works and how it works.

Create an epoll object

As shown in the figure below, when a process calls the epoll_CREATE method, the kernel creates an EventPoll object (that is, the object represented by EPFD in the program). The EventPoll object is also a member of the file system and, like sockets, has a wait queue.

The kernel creates the EventPoll object

Creating an EventPoll object that represents the ePoll is necessary because the kernel maintains data such as a ready list, which can be a member of EventPoll.

Maintain a watch list

After the epoll object is created, you can use epoll_ctl to add or remove the socket you want to listen on. Take adding a socket as an example. In the following figure, if sock1, SOck2, and SOck3 are monitored using epoll_ctl, eventPoll is added to the wait queue of these three sockets.

Add the socket to listen on

When the socket receives data, the interrupter operates on the EventPoll object instead of directly operating on the process.

Receive data

When the socket receives data, the interrupt routine adds a socket reference to EventPoll’s ready list. As shown in the figure below, after SOck2 and SOck3 receive data, the interrupt program tells RDList to reference the two sockets.

Add a reference to the ready list

The EventPoll object acts as an intermediary between the socket and the process. The socket’s data reception does not directly affect the process, but rather changes the state of the process by changing the ready list of EventPoll.

When the program executes to epoll_wait, epoll_wait returns if rdList already references the socket, or blocks the process if rdlist is empty.

Block and wake up the process

Suppose processes A and B are running on your computer, and at some point process A runs an epoll_WAIT statement. As shown in the figure below, the kernel blocks process A by putting it on A wait queue in EventPoll.

Epoll_wait blocks the process

When the socket receives the data, the interrupt modifies the RDList and wakes up the process in the EventPoll wait queue, and process A is running again (figure below). Also because of rdList, process A knows which sockets have changed.

Epoll Wakes up a process

Epoll implementation details

At this point, I believe readers have a certain understanding of the nature of epoll. But we also need to know what the data structure of EventPoll looks like?

Also, what data structures should be used for ready queues? What data structure should EventPoll use to manage sockets added or removed via epoll_ctl?

As shown in the figure below, EventPoll contains members such as LOCK, MTX, wq (wait queue), and RDList, of which RDList and RBR are of interest.

Epoll Principle diagram, image source: In-depth Understanding of Nginx: Module Development and Architecture Analysis (2nd edition), Tao Hui

Ready list data structure

The ready list refers to the ready socket, so it should be able to insert data quickly.

The program may call epoll_ctl at any time to add or remove the monitoring socket. If the socket is already in the ready list, it should also be removed when deleted. So a ready list should be a data structure that can be inserted and deleted quickly.

Bidirectional lists are one such data structure, and ePoll uses bidirectional lists to implement ready queues (corresponding to rDLList above).

Index structure

Since epoll separates “maintaining monitor queues” from “process blocking,” it also means that there needs to be a data structure to hold monitored sockets that are at least easy to add and remove, and easy to search for to avoid repeated additions. Red-black tree is a self-balanced binary search tree, with search, insert and delete time complexity of O(log(N)) and high efficiency. Epoll uses red-black tree as index structure (corresponding to RBR in the figure above).

Note: Because the operating system is juggling multiple functions and has more data to save, rdList does not refer directly to the socket, but indirectly through epItem. The nodes of the red-black tree are also EPItem objects. Likewise, file systems do not directly reference sockets. Some indirect structures have been omitted for ease of understanding.

Nine, summary

Epoll introduces EventPoll as an intermediate layer on the basis of SELECT and Poll. It uses advanced data structures and is an efficient multiplexing technology. To conclude, here is a simple tabular comparison of SELECT, poll, and epoll. I hope you find this interesting.

Oscimg.oschina.net/oscnet/7159…