Where does IO multiplexing come from

First the BIO

Before NIO, the client communicated with the server via BIO. In the BIO communication model, an independent Acceptor thread is usually responsible for listening to the connection of the client. After receiving the connection request from the client, it creates a new thread for each client to process the link. When the connection request is not completed, it returns the reply to the client through the output stream, and the thread is destroyed. It is a typical one – request – reply communication model.

C10k problem

We all know that the Internet is based on network communication, the early Internet can be said to be a collection of small groups. The Internet was not widespread enough, and there were not many users. At that time, 100 users on a server were estimated to be large applications, so there was no C10K problem. The explosion of the Internet was probably after WWW websites, browsers, Yahoo. The earliest Internet was called Web1.0. Most of the use of the Internet was to download an HTML page and view the information on the page in the browser. There was no C10K problem in this period. After the advent of Web2.0 era, it is different. On the one hand, the popularity rate is greatly improved and the user group grows geometrically. On the other hand, the Internet is no longer just browsing the World Wide Web, but increasingly interacting with each other, and the logic of applications is becoming more complex, from simple form submission to instant messaging and online real-time interaction, where the problems of C10K come to the surface.

Then NIO and IO multiplexing

If you use BIO to solve the C10k problem, you will face the problem of multi-threading, but threads are “expensive” resources, mainly in the following aspects:

  1. Threads are expensive to create and destroy, and in an operating system like Linux, a thread is essentially a process. Create and destroy are heavyweight system functions.
  2. Threads themselves occupy a large amount of memory, such as The Java thread stack, generally at least 512K to 1M space, if the system has more than a thousand threads, I am afraid that the entire JVM will eat up half of the memory.
  3. Thread switching costs are high. When a thread switch occurs in the operating system, the context of the thread needs to be preserved and then the system call is executed. If the number of threads is too high, the execution time of thread switch may even be longer than the execution time of thread. At this time, the system load is too high and the CPU SY usage is too high (more than 20%), which leads to the system almost falling into the state of unavailability.
  4. Jagged system load. Because the system load is based on the number of active threads or CPU cores, once the number of threads is high, but the external network environment is not very stable, it is easy to cause a large number of request results returned at the same time, activate a large number of blocked threads so that the system load pressure is too large.

NIO (non-blocking I/O, also known as New I/O in the Java world) is a synchronous non-blocking I/O model that underlies I/O multiplexing. IO multiplexing, in a nutshell, is the ability to “multiplex” a thread or process while monitoring whether several (” multiplex “) file descriptors can perform IO operations. In the PROCESS of I/O programming, multithreading or I/O multiplexing can be used to process multiple client access requests at the same time. I/O multiplexing allows the system to process multiple client requests simultaneously in a single thread by reusing multiple I/O blocks onto the same SELECT block. Compared with the traditional multithreading/multi-process model, the biggest advantage of I/O multiplexing is that the system cost is small, the system does not need to create new additional processes or threads, and does not need to maintain these processes and threads running, reducing the workload of system maintenance, saving system resources.

The advantage of IO multiplexing is not that you can process individual connections faster, but that you can process more connections. A Web server using SELECT /epoll does not necessarily perform better than a Web server using multi-threading + blocking IO and may have even greater latency if the number of connections being processed is not very high.

So how does IO multiplexing solve the C10k problem?

Background knowledge

  1. File descriptor

Inode: inodes record meta information about a file, including the file’s (creator, creation time, file size, userID, groupID, RWX permissions, number of links (how many filenames point to the inode), location of the file’s data blocks… .

  • Block: A file is stored on a hard disk. The smallest storage unit is sectors. Each sector is 512 bytes (0.5KB). However, when the operating system reads a file from a hard disk, the efficiency is very low.
  • The system opens a file: each inode has its own inode number, and the system internally uses inodes to identify files, not file names. In fact, every time a file is opened, the process goes like this: the system finds the inode number corresponding to the file name, finds the inode number based on the inode number, finds the block containing the file data based on the inode information, and then reads the data.

After each process is started, a section of memory is created to hold the file descriptor table of the process. The file descriptor can be considered as the index of the file descriptor table. To maintain file descriptors, the system establishes three tables: file description table at process level, file description table at system level (open file table), and inode table.

  1. File description table: Records information about a single file descriptor.
  2. Open file table: Each entry is called an open file handle and describes all information about an open file.
  3. Inode table: Each file system maintains an inode table for all files (directories) stored on it.

This section describes the I/O operation process to deepen understanding. When IO operations need to be performed, fd is first passed as a parameter (the index of the file descriptor table), and then the file pointer is used to find the record corresponding to the open file table. Then the inode pointer is used to find the inode pointed to by the file, and then the real location of the file is found for IO operations.Communication through sockets is actually through file descriptors fd Reading and writing files,Therefore, socket and fd are used as synonyms below. For example, passing fd_set to a function can be imagined as a series of sockets waiting for data. Any door:What are Linux process, thread, and file descriptors

  1. System interrupt

What is a system interrupt?

“Without interrupts, the operating system can do almost nothing. The operating system is interrupt-driven.” Since the CPU learns of something happening in the computer, the CPU pauses the program it is executing and instead executes the program that handles the event. When this program is finished, the CPU continues to execute the previous program. The whole process is called interrupt handling, also known as interrupt. Interrupts include external interrupts (hardware interrupts) and internal interrupts (software interrupts and exceptions). For external interrupts, the CPU provides you with two signal lines. External hardware interrupts are notified to the CPU using two signal cables: INTR(INTeRrupt) and Non Maskable (NMI).Internal interrupt soft interrupt, is initiated by the software initiative interrupt, because it comes from the software, so called soft interrupt. Because the interruption is initiated by the software itself, it is subjective rather than an objective internal error.

What does the system do when it interrupts the kernel?

Take the network card as an example.

  1. The nic receives the packet and maps it to the NIC buffer in memory through DMA.
  2. The nic issues an IRQ (Interrupt ReQuest) to the CPU
  3. If the CPU is executing process A, the CPU saves the process state to the process descriptor first
  4. The CPU was switched from the user mode to the kernel mode
  5. Execute the interrupt handler for the nic
  6. The CPU was switched from the core mode to the user mode
  7. Restores the scene of process A from the process descriptor
  1. The system calls

What is user state and kernel state

Before saying user state and kernel state, it is necessary to talk about the C P U instruction set. The instruction set is the medium of realizing software command hardware execution. Specifically, each assembly statement corresponds to a C P U instruction, and a very, very many C P U instructions together can form one or even multiple sets. The set of instructions is called the C P U instruction set. At the same time, the C P U instruction set has authority classification, we imagine, the C P U instruction set can directly operate the hardware, if because the instruction operation is not standardized, caused by the error will affect the entire computer system. For example, if you write a program, because you are not familiar with the hardware operation, the operating system kernel, and other programs that are running, may be irreparably wrong because of the operation error, and finally can only restart the computer. And for the operation of the hardware is very complex, many parameters and the probability of a problem is quite large, must be careful, is a tough task for developers, also can increase the burden, at the same time, the developer also not be trusted in this aspect, so the operating system kernel shielding the developer for hardware operation could directly, I don’t want you to touch the C, P, U instruction set. In view of the above requirements, hardware equipment vendors directly provide hardware level support by setting permissions on the C P U instruction set. Different levels of permissions can use the C P U instruction set is limited. Take Inter C P U as an example, Inter divides the operation permissions of C P U instruction set into four levels from high to low:

  • ring 0
  • ring 1
  • ring 2
  • ring 3

Ring 0 has the highest permission and can use all the C P U instruction sets, while Ring 3 has the lowest permission and can only use the conventional C P U instruction sets.You can’t useC P U instruction sets for operating hardware resources, such as I O read and write, network card access, and memory application, are not available. The Linux system only uses ring 0 and Ring 3 permissions.High eq

  • Ring 0 is called kernel-state and runs entirely within the operating system kernel
  • Ring 3 is called user mode and runs in the application

Low eq

  • Execute kernel space code, have ring 0 protection level, have all operation permissions on hardware, can execute all C P U instruction set To access memory at any address, any exception in kernel mode is catastrophic and will cause the entire machine to stop
  • In user mode, with Ring 3 protection level, code does not have direct control over hardware or access to the address’s memory. Applications access hardware and memory by calling the System Call APIs. In this protected mode, even if the application crashes, it can be recovered. On a computer, most programs run in user mode

When will the user-kernel switch take place?

  1. System calls, which in Linux are the only means of user-space access to the kernel
  2. interrupt
  3. abnormal
  1. The Socket base

Golang socket programming

Server side:

Func main() {serverSocket, err := net.listen (" TCP ", "127.0.0.1:9090") if err! = nil { fmt.Printf("serverSocket failed, err: %+v\n", err) return } for { socket, err := serverSocket.Accept() if err ! = nil { fmt.Printf("accept failed, err: %+v\n", err) continue } go process(socket) } } func process(socket net.Conn) { defer func() { _ = socket.Close() }() for { reader := bufio.NewReader(socket) var buf [128]byte n, err := reader.Read(buf[:]) if err ! = nil { break } fmt.Printf("received data: %+v\n", convUtils.UnsafeBytesToStr(buf[:n])) _, _ = socket.Write(convUtils.UnsafeStrToBytes("ok")) } }Copy the code

The client side

Func main() {// 1, establish a connection with the server socket, err := net.Dial(" TCP ", "127.0.0.1:9090") if err! = nil { fmt.Printf("socket server failed, err:%v\n", Input := bufio.NewReader(os.stdin) for {s, _ := input.ReadString('\n') s = strings.TrimSpace(s) if strings.ToUpper(s) == "Q" { return } _, err = socket.Write([]byte(s)) if err ! Printf("send failed, err:%v\n", err) return} var buf [1024]byte n, err := socket.Read(buf[:]) if err ! = nil {fmt.Printf("read failed:%v\n", err) return} fmt.Printf(" received :%v\n", string(buf[:n]))}}Copy the code

Working mechanism of socket read/write buffer

The app sends data to the peer end. Read ()/recv() and write()/send() are system calls.

  1. A system call is made using write()/send() provided by the operating system. The data to be sent is copied from the user state to the kernel state
  2. Write ()/send() does not immediately transfer data to the network, but writes the data to the buffer first
  3. The IMPLEMENTATION of the TCP/IP protocol in the OS sends data from the buffer to the target machine. Once data has been written to the buffer, functions can return successfully, regardless of whether they reached the target machine or when they were sent to the network, which is TCP’s responsibility
  4. TCP is independent of write()/send() functions. Data may be sent to the network as soon as it is written into the buffer, or the data may be stored in the buffer and sent to the network at one time. This depends on the network condition at that time, whether the current thread is idle, and many other factors
  5. After the peer nic receives the data, the CPU processes the read()/recv() function to read the data from the input buffer by triggering an interrupt
  6. The data is copied back to the user state of the receiving end and rendered by the upper-layer application to the user.

By default, TCP sockets are blocked. When write()/send() is used to send data:

  1. The buffer is checked first, and if the length of available space in the buffer is less than the data to be sent, write()/send() is blocked (paused) until the data in the buffer is sent to the target machine, freeing up enough space to wake the write()/send() function to continue writing data.
  2. If TCP is sending data to the network, the output buffer is locked and write()/send() is blocked until the buffer is unlocked and write()/send() is woken up.
  3. If the data to be written is greater than the maximum length of the buffer, it is written in batches.
  4. Write ()/send() does not return until all data has been written to the buffer.

When reading data using read()/recv() :

  1. The buffer is checked first, and if there is data in the buffer, it is read, otherwise the function is blocked until data arrives on the network.
  2. If the size of the data to be read is smaller than the size of the data in the buffer, it is not possible to read all the data in the buffer at once, and the remaining data is accumulated until read()/recv() is read again.
  3. The read()/recv() function does not return until the data is read, otherwise it is blocked.

This is the blocking mode of TCP sockets. The so-called blocking is that the previous action is not completed, the next action will be suspended until the last action is completed, to maintain synchronization. In non-blocking mode, the process is relatively simple. If the output buffer is full when write()/send(), send/recv will be called. Send/RECv will not block the program execution stream, but will immediately return an error, and we will get a related error code. When read()/recv(), if the input buffer is empty, the function returns immediately (return value **-1**, error code EWOULDBLOCK).

Dinner started

BIO Underlying communication principles

  • The CPU runs processes A, B, C, and D in its run queue. Of course, the CPU also runs kernel processes
  • Block to read the Socket when there is no data in the Socket

  • Process A enters the blocking state, exits the run queue, and enters the waiting queue of the Socket

  • The peer sends data to the server network card,
  • The nic stores the packets into RAM via DMA
  • The CPU is not involved

  • The nic initiates a hardware interrupt IRQ

  • The CPU is executing process B, which affects the interrupt
  • Saves the process B user stack information to the process descriptor
  • Switching from user to kernel mode means modifying the CPU register to point the stack pointer to the current process B kernel stack
  • Find the appropriate interrupt handler in the directional table according to the IRQ vector
  • Execute the interrupt handler for the nic
  • It looks like the process that ultimately executes the interrupt is process B, even though the interrupt has nothing to do with process B, right

  • The interrupt handler will fetch the corresponding message from the nic buffer in RAM
  • The socket port corresponding to the packet is parsed from the packet and the packet is transferred to the read buffer of the socket
  • Process A already exists in the waiting queue of the current socket, so process A needs to be removed from the waiting queue

  • Process A returns to the CPU run queue and has the opportunity to obtain the CPU time slice

Select

Function signature:

int select(int nfds,
            fd_set *restrict readfds,
            fd_set *restrict writefds,
            fd_set *restrict errorfds,
            struct timeval *restrict timeout);
Copy the code

The arguments passed to the select function tell the kernel:

  1. The file descriptors we care about (three categories, read, write, exception)
  2. For each descriptor, we care about the state
  3. How long do we have to wait

Fd_set is a bitmap structure with a default length of 1024, defined by a constant in the kernel, so the maximum length of fd processed by select is 1024. To change this length, you need to recompile the kernel. The use of fd_set involves the following apis:

#include <sys/select.h> int FD_ZERO(int fd, fd_set *fdset); Int FD_CLR(int fd, fd_set *fdset); Fd_set (int fd, fd_set *fd_set); Int FD_ISSET(int fd, fd_set *fdset); // Check whether a bit of fd_set is 1Copy the code

After returning from select, the kernel tells us the following:

  1. The number of descriptors that are ready for our requirements
  2. Which descriptors are ready for the three conditions (read, write, exception)

With this information, we can call the appropriate I/O functions, and these functions will no longer block.Select the disadvantages of the

  • The bitmap in fD_set is fixed at 1024 bits, meaning that a maximum of 1024 sockets can be listened on. Of course, you can also change the kernel source code, but the cost is relatively large;
  • Each time fd_set is passed into the kernel, it is overwritten and cannot be reused. Each call to SELECT requires reinitialization of fd_set.
  • Every time you call select, you need to copy a new fd_set into the kernel space.
  • After receiving the result of fd_set, the application needs to traverse the entire Fd_set to know which file descriptors have data to work with.

Poll

Poll and SELECT are almost indistinguishable. Poll stores file descriptors in a linked list with no limit on how many files can be stored. In terms of performance overhead, poll and SELECT are not very different.

epoll

Epoll is an improvement over select and poll, avoiding the two disadvantages of “high performance overhead” and “low number of file descriptors”. In short, epoll has the following characteristics:

  • Use red-black trees to store collections of file descriptors
  • Use queues to store ready file descriptors
  • Each file descriptor is passed only once when it is added; Change file descriptor state through events

The SELECT and poll models both use only one function, while the epoll model uses three functions: epoll_create, epoll_ctl, and epoll_wait, which we describe separately.

epoll_create

int epoll_create(int size);
Copy the code

Epoll_create creates an epoll instance and returns a file descriptor that refers to the instance. The returned file descriptor only points to the corresponding epoll instance and does not represent the actual disk file node. Other apis such as epoll_ctl and epoll_wait use this file descriptor to manipulate the corresponding epoll instance. When the epoll handle is created, it takes up a fd value, which can be seen by looking at /proc/process id/fd/under Linux. So after epoll is used, close(EPfd) must be called to close the corresponding file descriptor, otherwise fd may be exhausted. When all file descriptors pointing to the same epoll instance are closed, the operating system destroys the epoll instance. Epoll instance internal storage:

  • Listener list: All file descriptors to listen on, using a red-black tree
  • Ready list: All ready file descriptors, using linked lists

epoll_ctl

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
Copy the code

Epoll_ctl listens for the event event occurring on the file descriptor fd. Parameter Description:

  • Epfd is the file descriptor returned by epoll_create and points to an epoll instance

  • Fd indicates the target file descriptor to listen on

  • Event indicates the event to listen for (readable, writable, send error…).

  • Op represents the operation to be performed on a fd.

    • EPOLL_CTL_ADD: Adds a listening event to the fd
    • EPOLL_CTL_MOD: Change the event event associated with the target file Descriptor fd Change the event event associated with the target file Descriptor fd
    • EPOLL_CTL_DEL: Removes all listening events of a FD, in which case the event parameter is useless

The return value 0 or -1 indicates whether the operation succeeds. Epoll_ctl adds the file descriptor fd to the epoll instance’s listener list, sets a callback function for fd, and listens for the event. When an event occurs on a FD, the callback function is called to add the FD to the ready queue of the epoll instance.

epoll_wait

int epoll_wait(int epfd, struct epoll_event *events,
               int maxevents, int timeout);
Copy the code

This is the main function of the Epoll model and is equivalent to select. Parameter Description:

  • Epfd is the file descriptor returned by epoll_create and points to an epoll instance
  • Events is an array that holds file descriptors of ready states, whose space is claimed by the caller
  • Maxevents specifies the size of events
  • Timeout is similar to timeout in SELECT. If no file descriptor is ready, that is, the ready queue is empty, epoll_wait blocks timeout milliseconds. If timeout is set to -1, epoll_wait blocks until a file descriptor is ready. If timeout is set to 0, epoll_wait returns immediately

The return value represents the number of ready descriptors stored in Events, up to and including maxEvents.

The advantages of the epoll

Epoll is an improvement over SELECT and poll that avoids the drawbacks of “high performance overhead” and “low number of file descriptors”. For “low number of file descriptors,” SELECT uses an integer array to store the set of file descriptors, while epoll uses a red-black tree to store the large number. For “high performance overhead,” epoll_ctl specifies a callback function for each file descriptor and adds it to the ready list when it is ready, so epoll does not need to iterate over each file descriptor as select does, just to determine whether the ready list is empty. This way, epoll can unload system resources earlier when no descriptor is in place.

So the time complexity goes from O(n) to O(1)

In addition, each call to SELECT requires a copy of the entire set of descriptors to be listened on to the kernel, whereas epoll only needs to be passed once on epoll_ctl for each descriptor, after which epoll_wait does not need to be passed again. It also greatly improves efficiency.

Horizontal trigger, edge trigger

LT (Level Trigger) : A notification is triggered when the file descriptor is ready, and a readable/writable signal is sent next time if the user program does not read/write all the data at once.

Edge Trigger (ET) : Notifies a descriptor once and never again when it becomes ready. Select supports only horizontal triggering, epoll supports both horizontal triggering and edge triggering.

Differences: Edge triggering is more efficient, reducing the number of times events are fired repeatedly, and functions do not return a large number of file descriptors that may not be needed by user programs.

Horizontal trigger, edge trigger name source: digital circuit potential level, high and low level switching moment trigger action is called edge trigger, and high level trigger action is called horizontal trigger.

Reference documentation

Brief Analysis of Java NIO — Meituan Technical team

User – mode kernel – mode and multiplexing IO model of the program

IO multiplexing

All that I/O multiplexing stuff

IO multiplexing basic principle full solution

The C10K problem–Dan Kegel’s Web Hostel

Linux SOCKET programming details

Explain IO multiplexing and its three modes — SELECT /poll/epoll

Linux kernel interrupts

Operating System True Restoration – Chapter 7 interrupts

Understand user state and kernel state from root

Socket buffer and blocking mode details

Blocking vs. non-blocking sockets—Scott Klement’s web page

📔 [operating system] I/O multiplexing, select/poll/epoll

Thoroughly understand IO reuse: IO processing killer mace, with your in-depth understanding of SELECT, poll, epoll