preface

In this section, we introduce THE IO reuse, through a simple example to demonstrate the use of IO reuse, and the implementation principle, this technology is currently the construction of the current high-performance server technology, in the later we will introduce the various network programming model, will use IO reuse.

After reading this article, you will know:

IO reuse execution process;
The use and advantages and disadvantages of select function, as well as the implementation principle;
The advantages and disadvantages of poll function, as well as the implementation principle;
The usage, advantages and disadvantages of epoll function, as well as the implementation principle;
Conditional trigger and edge trigger of epoll, and the implementation principle.

1. Introduction of I/O multiplexing model

I/O multiplexing refers to the system call through a function that senses multiple descriptors at the same time, blocking on the system call and waiting for one or more descriptors to be ready before returning a readable condition. Common system calls such as SELECT, poll, and epoll can implement this functionality. This model does not block on real I/O system calls.

The working principle is shown in the figure below:

As shown in the figure above, this model replaces the rotation to determine whether the data is ready with a select() system call instead of non-blocking I/O.

Select, poll, and epoll are commonly used to implement I/O reuse.

Select function

** SELECT is a classic system call function that implements I/O multiplexing. ** SELECT () can wait for more than one socket to become readable at the same time, and will return as soon as any socket is readable.

2.1. Select function definition

#include <sys/select.h>

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);

void FD_CLR(int fd, fd_set *set);
int  FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);

int pselect(int nfds, fd_set *readfds, fd_set *writefds,
            fd_set *exceptfds, const struct timespec *timeout,
            const sigset_t *sigmask);

Copy the code

Select function parameters:

Int NFDS: Specifies the number of descriptors to be tested. The value is the maximum descriptor to be tested plus 1;
Fd_set readfds: specifies the descriptor to be read by the kernel test;
Fd_set writefds: specifies the descriptor to be tested by the kernel.
Fd_set ExceptfDS: Specifies the descriptor to make the kernel test exception;
Timeval Timeout: Tells the kernel how long to wait for any of the specified descriptors to be ready.

There is one important structure: fd_set, which is used to store the descriptor set. The underlying structure uses bitmap to record the descriptor.

There are four macros associated with it:

FD_CLR: clears all bits in fdset.
FD_SET: enable the bit corresponding to the FD descriptor in fdSET.
FD_ZERO: closes the bit corresponding to the FD descriptor in fdset.
FD_ISSET: Checks whether the bit corresponding to fd descriptor is enabled.

2.2. Select function example

Here is an example of how select is used and how it works.

This example opens a listening socket, gets five client connections, uses the select function to determine if any data reaches the server, and if so, reads it. I’ve annotated it with detailed comments, but I’ll focus on the numbered code below:

1. The SOCKET

Call the socket to create a listening socket and get the listening socket descriptor.

2, the BIND

Call bind to assign the local protocol address to the socket;

3, LISTEN

The listen call converts to a passive socket and begins accepting connection requests to that socket;

4. Get MAXFD

In the process of obtaining 5 connected sockets, the maximum socket file descriptor is judged to be obtained.

5. Initialize FD_SET

Inside the loop, rSET needs to be reset each time select is called again. In step 7 we explain why we do this;

Fd_set is a bitmap with a fixed size set by the kernel and a maximum length of 1024, which limits us to listening on a maximum of 1024 descriptors at the same time. If the five descriptors we have here are: 1, 2, 5, 6, 8, the bitmap would look like this:

6,`SELECT`Function passes the descriptor to be tested +1

Why did I add 1 here?

According to step 5, the bitmap in fd_set starts at 0, so the actual valid bitmap length of rset is the descriptor to be tested +1.

7,`SELECT`Pass in a descriptor for the kernel to test read, then block and wait for the kernel to return

This step goes like this:

After the application process calls SELECT, fd_set is copied from user space to kernel space, and the application process blocks.
The kernel obtains the descriptor to be processed according to fd_set, and sets the fd_set according to whether the descriptor has data ready.
When the process wakes up and gets the fd_set processed by the kernel, it can use FD_ISSET to determine whether the socket data is ready.

8. Determine whether the descriptor is readable

The identifier in FD_ISSET corresponding to the socket descriptor of the prepared data is marked. The result of the marking can be determined by FD_ISSET.

2.3 advantages and disadvantages of select function

2.3.1, strengths,

Non-blocking I/O direct rotation Check whether data is ready. The kernel mode must be changed each time. The select function hands over the query of multiple descriptors directly to the kernel, which avoids CPU consumption and reduces kernel state switching.

2.3.2 and disadvantages

From the above procedure description, we can know that select has the following disadvantages:

The bitmap in fD_set is fixed at 1024 bits, meaning that a maximum of 1024 sockets can be listened on. Of course, you can also change the kernel source code, but the cost is relatively large;
Each time fd_set is passed into the kernel, it is overwritten and cannot be reused. Each call to SELECT requires reinitialization of fd_set.
Every time you call select, you need to copy a new fd_set into the kernel space.
After receiving the result of fd_set, the application needs to traverse the entire Fd_set to know which file descriptors have data to work with.

3, poll

Based on epoll’s shortcomings, a second system call, poll, emerged, which interacts with the kernel differently and breaks the limit on the number of file descriptors.

3.1. Poll function definition

Here is the poll function definition:

#include <poll.h> int poll(struct pollfd *fds, nfds_t nfds, int timeout); // Returns the number of ready descriptors if there are any, 0 if timeout, and 1 if errorCopy the code

The poll function argument:

Pollfd * FDS: a pointer to the first element of an array of structures. Each element is a PollFD structure that specifies the conditions under which a given descriptor fd is tested.

struct pollfd { int fd; /* File descriptor to be checked */ short events; /* The type of event to be detected on the descriptor */ short revents; /* Return the state of the event corresponding to the descriptor */};Copy the code

Int fd: indicates the file descriptor to be detected.
Short events: is the event type to be detected in the descriptor. The short type is used here, and the specific implementation is completed by binary mask bit operation. Commonly used event types are as follows:

Short Revents: Returns the state of the event corresponding to the descriptor. Pollfd will respond to the specific state of the event after it is returned by the system call.
Nfds_t NFDS: NFDS specifies the size of the FDS array.
Int timeout: Specifies how long the poll function should wait before returning.

Now let’s look at specific examples.

3.2 examples of poll functions

Here is an example of how poll is used and how it works.

Similar to the select example, a listener socket is opened, and five client connections are obtained. The poll function determines whether any data reaches the server, and if so, it reads the data.

1. Set the POLLFD descriptor

Here, the accept block gets the connected descriptor and assigns it to the fd of the PollFD structure.

2. Set the POLLFD event

Then set the POLLIN to the Events of the PollFD, specifying that the POLLIN needs to be detected, that is, the data read.

3, POLL

Call the poll function and pass in 5 pollFDs that you just initialized with a timeout of 10 seconds. Here the block waits until it is returned from the kernel.

Similar to select, this step is executed like this:

After calling poll, poll_fd will be copied from user space to kernel space, and the application will block.
The kernel obtains descriptors based on the fd of poll_fd and sets the revents of poll_fd according to whether the descriptors are ready for data.
When the process wakes up and gets poll_fd processed by the kernel, it can determine whether the corresponding event has been set by the operation to know if the socket data is ready.

4. Determine if the event is ready

After returning from the kernel, we loop through the revents of each element in the PollFds and see if the POLLIN has been set, which indicates that the data is ready.

5. Reset events

Now that revents has been reset, you can reuse the pollfds next time and continue with the poll function.

3.3 advantages and disadvantages of poll function

3.3.1, strengths,

Similar to SELECT, non-blocking I/O directly rotates to check whether data is ready, and each time the query needs to switch the kernel state, which consumes CPU. Poll, on the other hand, transfers the operation of querying multiple descriptors to the kernel, avoiding CPU consumption and reducing the kernel state switch.
Instead of using a bitmap, poll_fd is used directly. There is no limit of 1024 descriptors.
The poll_fd structure is introduced here, and the kernel simply modifies revents in the poll_fd structure so that poll_FD can be reused by resetting revents each time data is read, rather than repeatedly initializing a new Rset as select does.

3.3.2 rainfall distribution on 10-12 and disadvantages

Each call to poll requires a new poll_fd to be copied into the kernel space, where a user-to-kernel switch is performed.
After receiving the result of a poll_fd, the application needs to traverse the entire poll_FD to know which file descriptors have data to work with.

4, epoll

Unlike Poll, epoll is not itself a system call, but rather a kernel data structure that allows processes to multiplex I/O over multiple file descriptors.

There are three system calls to create, modify, and delete this data structure.

4.1 Correlation functions of epoll

4.4.1, EPOLL_CREATE

The epoll instance is created by the epoll_CREATE system call, which returns the file descriptor to the epoll instance with the following function definition:

#include <sys/epoll.h>

int epoll_create(int size);

Copy the code

The size parameter indicates to the kernel the number of file descriptors that the process is monitoring, which helps the kernel determine the size of the epoll instance. Starting with Linux 2.6.8, this parameter is ignored because the epoll data structure dynamically resizes as file descriptors are added or removed.

The epoll_CREATE system call returns the file descriptor for the newly created epoll kernel data structure. The call procedure can then use this file descriptor to add, remove, or modify other file descriptors for the I/O of the epoll instance it wants to monitor.

The user process finally gets the epoll instance descriptor, EPFD, to support access to the epoll instance:

There is another system call, epoll_create1, which is defined as follows:

int epoll_create1(int flags);

Copy the code

The flags argument can be 0 or EPOLL_CLOEXEC.

If the value is set to 0, epoll_create1 behaves the same as epoll_create.
When the EPOLL_CLOEXEC flag is set, any child processes derived from the current process will close the epoll descriptor before execution, so that child process will no longer be able to access the epoll instance;

4.1.2, EPOLL_CTL

A process can add the file descriptor it wants to monitor to an epoll instance by calling epoll_ctl.

All file descriptors registered with an epoll instance, collectively called epoll’s list of interests [1], are wrapped into an EPItem structure and placed into a red-black tree RBR:

In the figure above, the user process registers the file descriptors fd1, fd2, fd3, fd4 with the ePoll instance, which is the interest list set for the ePoll instance.

When any registered file descriptors are ready for I/O, they are put into the event-ready queue. The event ready queue is a subset of the list of interests. The kernel moves the EPItem in the RBR to the event-ready queue when it receives an I/O ready event callback.

The epoll_ctl system call is defined as follows:

#include <sys/epoll.h>

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

Copy the code

Int epfd: The file descriptor returned by epoll_create that identifies the epoll instance in the kernel;
Int op: indicates the operation to be performed on the file descriptor fd. Generally, three operations are supported:
EPOLL_CTL_ADD: Registers the event corresponding to the file descriptor with the epoll instance;
EPOLL_CTL_DEL: Unregister the FD from the epoll instance. This means that the process will no longer get any notifications about events on the file descriptor. If a file descriptor has been added to multiple Epoll instances, closing the file descriptor removes it from the list of all epoll interests to which the file has been added
EPOLL_CTL_MOD: modifies the event corresponding to the file descriptor.
Int fd: the file descriptor we want to add to the epoll interest list;
Struct epoll_event *event: pointer to the structure named epoll_event that stores the event for which we actually want to monitor fd.

typedef union epoll_data { void *ptr; int fd; /* Uint32_t u32; uint64_t u64; } epoll_data_t; struct epoll_event { uint32_t events; /* Epoll_data_t data; /* Epoll_data_t data; /* User data variable */ };Copy the code

4.1.3, EPOLL_WAIT

The epoll_wait system call can be invoked to wait until the kernel notifies the process of an event on the interest list of the epoll instance, which blocks until any descriptor being monitored is ready for I/O.

The function is defined as follows:

#include <sys/epoll.h>

int epoll_wait(int epfd, struct epoll_event *events,
               int maxevents, int timeout);

Copy the code

Int epfd: epoll instance descriptor;
Struct epoll_event *events: return an array of I/O events to user space;
Int maxEvents: Specifies the maximum event value that epoll_wait can return;
Int timeout: specifies the blocking call timeout. -1 indicates no timeout, and 0 indicates immediate return.

4.2 Epoll Example

Note: The epoll mechanism was introduced after Linux 2.6, so Mac OS does not support it. In Mac OS, Kqueue replaces Epoll to implement I/O multiplexing.

1. Define EPOLL_EVENT

Define an epoll_event that stores information about the FDS that are actually to be monitored and a list of ready events to receive from ePLl_WAIT.

2. Call EPOLL_CREATE

Create an epoll instance in the kernel and play with the EPFD file descriptor.

3. Set event.data.fd

The event structure sets the actual descriptor to monitor.

4. Set event.events

The event and value are the actual events to monitor.

5. Call EPOLL_CTL

The event to be monitored is added to the epoll instance.

6. Call EPOLL_WAIT

Gets events that occur on the list of interests in a kernel epoll instance, that is, the contents of the event-ready queue. Epoll_wait returns the size of the ready queue.

4.3 Epoll principle analysis^[2]

Look at that picture again:

Here we focus on the following related constructs of epoll instances:

Eventpoll ePoll instance rdlList: event-ready queue RBR: red-black tree for quickly finding FDS EPItem: An EPItem is created for each FDCopy the code

Each time a user process requests an epoll_WAIT call, the kernel needs to pass the EPFD descriptor to find the eventPoll instance that the user wants to access.
Each time epoll_ctl is called to subscribe to events for descriptors, it wraps the descriptors and event-related content into an EPItem structure, and then adds tree nodes to the red-black tree RBR, where all the descriptors of interest to the user process are stored.
The kernel sets the ep_poll_callback for each file descriptor through the ep_ptable_queue_proc function. If an event occurs on the corresponding file descriptor, the callback will be called, which triggers the kernel to search the red-black tree and move the required socket EPItem to the event-ready queue.
Finally, when epoll_WAIT is executed to copy the contents of the event-ready queue from kernel space to user space, the poll method of each file descriptor is called again to determine that an event actually occurred and thus to ensure that the event is still valid.

For more detailed implementation details, read the epoll source code [2].

4.4 Edge triggering and conditional triggering

Let’s talk about the difference between edge triggers and conditional triggers.

4.4.1 Edge triggering

When an epoll_event added to an epoll instance is set to EPOLLET edge-triggered, the call to epoll_wait returns the corresponding epoll_event to the application process if a subsequent event with a descriptor is ready. In edge-triggered mode, the epoll_EVnet of the prepared descriptor is returned only once, meaning that the program has only one chance to process it.

4.4.2 Conditional trigger

When the epoll_event to be added to an epoll instance is set to level-triggered by EPOLLLT, the next call to epoll_wait is returned to the application process as long as the prepared descriptor is not completely processed. This is the system default.

EPOLLET edge firing is more efficient than EPOLLLT because the application process is notified only once for each ready socket, but this requires the programmer to be careful not to leave multiple opportunities for you to compensate for processing the socket.

4.4.3 Implementation principle

For conditional triggering, the descriptors returned to the kernel space are added to the ready queue again, and the next time epoll_wait is called, these epoll_items are reprocessed: the poll method of the file descriptor is called to determine whether the event is still valid, and if it is, the poll_item is returned, thus enabling conditional triggering.

In the case of edge firing, the descriptor returned to kernel space will not be placed on the ready queue again, so it will only be returned once.

4.5 Advantages and disadvantages of epoll

advantages

When epoll calls epoll_WAIT, it does not pass the structure into the kernel space like poll calls. Instead, epoll instance structures of the kernel are reused and referenced by EPFD, thus reducing the system overhead.
The bottom layer of epoll is that once there is an event, the socket calls the callback to notify the epoll instance immediately, so that the event ready queue can be prepared as soon as possible, and the execution of epoll_wait can be correspondingly faster.
The ePOLL layer maintains a list of events of interest based on the red-black tree, so that each time a new event triggers a callback on a socket, the epItem of the socket can be found faster for subsequent processing.
Provides better performance edge triggering mechanism.

Because epoll has so many advantages, many technologies are implemented based on epoll, such as Nginx, Redis, and Java NIO under Linux.

disadvantages

It is not truly asynchronous IO yet, and only copies data from the kernel to the application process when the application process calls IO functions. Finally click here to receive the Java Architecture package!!

Thoroughly understand IO reuse: IO processing killer mace, with your in-depth understanding of SELECT, poll, epoll