Basic Concepts of Linux

User space/kernel space

Today’s operating systems use virtual storage, so for 32-bit operating systems, the addressing space (virtual storage) is 4G (2 ^ 32). The core of the operating system is the kernel, which is independent of ordinary applications and has access to the protected memory space as well as all permissions to access the underlying hardware devices. In order to ensure that user processes cannot directly operate the kernel and ensure kernel security, the operating system divides the virtual space into two parts, one is the kernel space and the other is the user space.

For Linux, the highest 1G byte (from virtual addresses 0xC0000000 to 0xFFFFFFFF) is used by the kernel and is called kernel space, while the lower 3G byte (from virtual addresses 0x00000000 to 0xBfffff) is used by various processes and is called user space.

File descriptor

A File descriptor is a computer science term, an abstract concept used to describe a reference to a File. The file descriptor is formally a non-negative integer. In fact, it is an index value that points to the record table of open files that the kernel maintains for each process. When a program opens an existing file or creates a new file, the kernel returns a file descriptor to the process. In programming, some low-level programming tends to revolve around file descriptors. However, the concept of file descriptors is usually only applicable to operating systems such as UNIX and Linux.

Process switching

To control process execution, the kernel must have the ability to suspend a process running on the CPU and resume execution of a previously suspended process. This behavior is called process switching. Therefore, it can be said that any process runs under the support of the operating system kernel, is closely related to the kernel, and process switching is very expensive. Moving from a running process to a running process goes through the following changes:

  1. Holds processor context, including program counters and other registers.
  2. Update PCB information.
  3. Move the process PCB to the appropriate queue, such as ready, blocked at an event queue, etc.
  4. Select another process to execute and update its PCB.
  5. Update memory management data structures.
  6. Restore processor context.
The cache I/O

Cache I/O, also known as standard I/O, is the default I/O operation of most file systems. In the Linux cache I/O mechanism, the operating system caches I/O data in the file system’s page cache. That is, the data is first copied to the buffer of the operating system kernel, and then copied from the buffer of the operating system kernel to the address space of the application program.

Disadvantages of cache I/O: Data needs to be copied in the application address space and the kernel for many times during data transmission. These data copying operations bring high CPU and memory overhead.

Process of blocking

If some expected events do not occur in the executing process, such as system resource request failure, waiting for the completion of an operation, new data has not arrived, or no new work is done, the system automatically executes Block primitives to change the running state into the blocked state. It can be seen that the blocking of a process is an active behavior of the process itself, so only the running process (which has received CPU resources) can be put into the blocking state. When a process is blocked, it consumes no CPU resources. (Link: juejin.cn/post/688298…

Five IO models

  • [1] Blocking IO – Blocking IO
  • [2] Nonblocking IO – Nonblocking IO
  • [3] IO multiplexing multiplex
  • [4] Signal Driven IO – Signal driven IO
  • [5] Asynchronous IO – Asynchronous IO

The first four types of IO can be categorized as synchronous IO – synchronous, IOsignal Driven IO is usually used less, so I will not introduce it here

Blocking IO

The user process process waits on both phases of the Blocking IO read recvfrom operation. When data is not ready, Process waits for the kernel to prepare data in place. After the kernel prepares the data, Process continues to wait for the kernel to copy the data to its buffer. Process does not return from the recvFROM system call until the kernel has finished copying the data.

NonBlocking IO to synchronize IO

Process does not block wait during the first phase of NonBlocking IO read recvFROM, and recvfrom immediately returns an EWOULDBLOCK error if kernel data is not ready. When the kernel prepares the data and enters the second stage of processing, process waits for the kernel to copy the data to its buffer. After the kernel completes the copy, process returns from the recvFROM system call.

IO multiPlexing

IO multiplexing: Select Poll epoll model Processes are block-on-wait in both processing phases. However, select Poll epoll has the advantage of being able to listen on multiple I/OS at a low cost

Asynchronous I/o

Asynchronous IO requires that process not wait at either stage of the recvFROM operation, that is, process calls recvFROM and returns immediately, The kernel prepares the data itself and copies it from the kernel’s buffer to Process’s buffer, which notifies Process that the read operation is complete and process processes it.Unfortunately, there is no asynchronous IO in Linux network IO, and the second stage of Linux network IO processing is always blocked waiting for the data copy to complete. The real network asynchronous IO is the IOCP (IO completion port) model under Windows.

Linux socket event wakeup callback mechanism

[1] Select poll epoll_wait block wait logic

  1. Select, poll, and epoll_wait are trapped in the kernel to determine whether the monitored socket has an event of interest. If not, a wait_entry node is constructed for the current process and inserted into the sleep_list of the monitored socket
  2. Enter the schedule of the loop until the event of interest occurs
  3. After the event of interest has occurred, the wait_entry node of the current process is removed from the socket sleep_list.

[2] Wake up logic

  1. The socket’s event occurs, and the socket traverses its sleep queue in sequence, calling each wait_entry node’s callback function in turn

  2. Do not stop until the queue is traversed or until a wait_entry node is exclusive.

  3. In general, the callback contains two logic:

    1. Wait_entry custom private logic; 2. Common logic to wake up, primarily to put the Wait_entry process into the CPU's ready queue so that the CPU can schedule its execution later.Copy the code

The select logic

1, FDS [file descriptors]

2. Three questions remain:

(1) The monitored FDS set is limited to 1024, which is too small. We hope to have a relatively large monitored FDS set

(2) FDS sets need to be copied from user space to kernel space, we hope that no copy is needed

(3) When some of the monitored FDS has data that can be read, we want the notification to be more refined. That is, we want to be able to get a list of FDS with readable events from the notification, rather than having to traverse the whole FDS to collect them.

Poll logic:

Poll changes the way FDS sets are described, using pollFD instead of SELECT’s FD_set structure, making poll support much more limited than select’s 1024 FDS sets.

Epoll logic:

In the computer industry, there are two kinds of problem-solving ideas:

[1] Any problem in computer science can be solved by adding an intermediate layer

[2] Change from centralized (central) processing to decentralized (distributed) processing

Let’s see how epoll applies these two ideas to solve the legacy problems (2) and (3) of select.

Copy of FDS collection

For IO multiplexing, there are two things you must do (for monitoring readable events) :

  1. Prepare the set of FDS you need to monitor;
  2. Detect and return which FDS in the FDS collection are readable.

Looking closely at the function prototype for SELECT or poll, we see that each call to select or poll repeatedly prepares (centralizes) the entire set of FDS that need to be monitored. However, for select or poll calls that are frequently invoked, the FDS collection changes much less frequently, and we don’t need to reprepare (centralize) the entire FDS collection each time.

Epoll introduces the epoll_ctl system call to separate the epoll_wait that is called at high frequencies from the epoll_ctl that is called at low frequencies. At the same time, epoll_ctl uses three operations (EPOLL_CTL_ADD, EPOLL_CTL_MOD, and EPOLL_CTL_DEL) to disperse the modification of the FDS set to be monitored. Change the select or poll high frequency, large block memory copy (centralized processing) to epoll_CTL low frequency, small block memory copy (decentralized processing), avoiding a large number of memory copies.

In addition, epoll uses epoll_ctl to add, delete, and change the monitored FDS set, which must involve the fast search problem of FD. Therefore, it is necessary to organize the monitored FDS set with a low time complexity data structure of add, delete, change, and search. Prior to Linux 2.6.8, epoll used hashes to organize FDS collections, so when creating an epoll FD, ePoll needed to initialize the hash size. So epoll_create(int size) takes the size argument so that the kernel assigns the size of the hash based on size. In Linux kernels after 2.6.8, epoll uses a red-black tree to organize monitored FDS sets, so the size parameter of epoll_CREATE (int size) is virtually meaningless.

Walk through the ready FDS collection as needed

Epoll introduces an intermediate layer, a bidirectional linked list (ready_list), and a separate sleep queue (single_epoll_WAIT_list)

ET(Triggered Edge) vs LT(Triggered Level)

It is necessary to talk about two modes of Epoll events. The following are the basic concepts of the two modes:

Edge Triggered (ET) Edge Triggered
  • A read event is triggered when the status of the socket receive buffer changes, that is, when the empty receive buffer has just received data

  • A write event is triggered when the status of the send buffer of a socket changes. That is, a read event is triggered when the full buffer is empty

Trigger events only when the buffer state changes, such as when the data buffer is removed from scratch (unreadable – readable)

Level Triggered (LT)
  • If the socket receive buffer is not empty and there is data to read, the read event is always triggered

  • If the socket send buffer is insufficient to continue writing data, the write event is always triggered

The event returned by epoll_wait is the state of the socket