Understand epoll in depth

1 introduction

Epoll is an old concept, a classic required course for back-end engineers. The characteristic of this kind of knowledge is that there are many people studying it, so the trend of research will be deeper and deeper. Of course, the people who share are also much, because of the uneven level of the sharer, but also a lot of misunderstanding.

I’m sharing epoll again today, and I’m sure I won’t make a table comparing the differences, which would be boring. I will introduce the principle behind epoll in depth from the principle of thread blocking, interrupt optimization and data processing process of network card. Finally, I will also introduce some popular views of DISS. Whether you are already familiar with epoll or not, this article will be of value to you.

2 the introduction

Before we begin, let me ask you a few questions.

1. How high is epoll performance? Many articles have described how ePoll can easily handle hundreds of thousands of connections. The performance of epoll is thousands of times that of traditional IO, which can handle only a few hundred connections.

2. Many articles divide network IO into blocking, non-blocking, synchronous and asynchronous. Non-blocking performance is better than blocking performance, and asynchronous performance is better than synchronous performance.

If blocking causes poor performance, why would traditional IO block?
Does epoll need to block?
Java NIO and AIO are both implemented in epoll. How to understand the difference between synchronous and asynchronous?

3, are IO multiplexing.

Why have there been select, poll and epoll?
Why is epoll performing better than Select?

PS:

This paper consists of three parts: introduction to Epoll, principle behind epoll and Diss.

The focus of this article is to introduce the principle, and the reader is advised to focus as much as possible on: “why”.

The difference between a process and a thread on Linux isn’t that great, especially when discussing principles and performance issues, so the terms “process” and “thread” are used interchangeably in this article.

3 first epoll

Epoll is an extensible I/O event notification mechanism of the Linux kernel, which is characterized by its excellent performance. The following is a performance test of select, Poll, epoll, and Kqueue by Libevent.

Many articles have referenced this benchmark when describing epoll performance, but few have clearly explained the results.

This is a benchmark that limits 100 active connections to 1000 read and write operations per connection. The vertical axis is the response time of the request, and the horizontal axis is the number of socket handles held. As the number of handles increased, epoll and kqueue response times barely changed, while poll and SELECT response times increased significantly.

As can be seen, epoll performance is very high, and the advantages of epoll become more obvious as the number of file descriptors monitored increases.

However, the 100 connections that are limited here are important. Epoll performs well when dealing with a large number of network connections, but only when there are few active connections. In other words, ePoll performs well when handling a large number of inactive connections. If all 15,000 sockets are active, epoll and SELECT are not much different.

Why does epoll’s high performance have such limitations?

The list of problems seems to be piling up. It looks like we need more research.

4 Principles behind epoll

4.1 the blocked

4.1.1 Why is it Blocked

Using an example of a network card receiving data, let’s review the process of a network card receiving data that I shared earlier.

For the sake of understanding, I’ve tried to simplify the technical details as much as possible by dividing the process of receiving data into four steps:

The NIC receives the data and writes it to memory (Ring Buffer and SK_buff) via DMA.
The NIC issues an interrupt request (IRQ) to tell the kernel that new data is coming.
The Linux kernel responds to the Interrupt, switching to kernel mode, processing the Interrupt Handler, taking a Packet from the RingBuffer, processing the protocol stack, filling the Socket and handing it to the user process.
The system switches to user mode, and user processes process data content.

When the network card receives data depends on the sender and the transmission path, and the latency is usually very high, in milliseconds (ms). Applications process data at the nanosecond (NS) level. That is to say, the whole process, the kernel state waiting for data, processing the protocol stack is a relatively slow process. There is nothing for the user mode process to do for such a long time, hence the use of “blocking”.

4.1.2 Blocking Does not occupy CPUS

Blocking is a key part of process scheduling. It refers to the waiting state of a process before an event occurs. Take a look at the following table. In Linux, there are roughly seven different process states (include/ Linux /sched.h has more states) :

As you can see from the instructions, “runnable” takes up CPU resources, as does creating and destroying processes (the kernel). The point is that when a process is “blocked/suspended”, it does not consume CPU resources.

Put it another way. To support multitasking, Linux implements process scheduling (CPU time slice scheduling). This timeslice is switched only between runnable processes. Therefore, “blocked/suspended” processes do not consume CPU resources.

In addition, to facilitate time slice scheduling, all processes in the runnable state form a queue, called ** “work queue” **.

4.1.3 Blocking recovery

Of course, the kernel can easily change the state of a process. The problem is which process the kernel should change in network IO.

Socket structure, which contains two important data: process ID and port number. The process ID holds the connect, send, read suspended process. When a socket is created, the port number is determined. The operating system maintains a data structure with a port number into the socket.

When the nic receives data, the data must contain a port number, and the kernel can find the socket and get the ID of the “pending” process. Change the state of the process to Runnable (join the work queue). At this point the kernel code completes execution and returns control to the user state. User processes can process data through normal “CPU slice scheduling.”

4.1.4 Process model

The whole process described above is basically the basic principle of BIO (Blocking IO). User processes handle their own business independently, which is actually a process model.

4.2 Optimization of context switching

There are two areas of the procedure described above that cause frequent context switches and can be inefficient.

If data packets are received frequently, the NIC may issue frequent interrupt requests (IRQs). The CPU may be in user mode, it may be in kernel mode, it may still be working on the protocol stack of the previous piece of data. However, the CPU must respond to the interrupt as soon as possible. Doing so is actually quite inefficient, causing a lot of context switching and potentially leaving the user process without data for a long time. (Even if it is multi-core, the protocol stack is not finished each time, naturally cannot be handed to the user process)
Each Packet corresponds to a socket, and each socket corresponds to a user-mode process. When these user-mode processes become “runnable”, context switching between processes is inevitable.

4.2.1 NAPI mechanism of the NIC driver

On the NIC, the technique for addressing frequent IRQs is called the New API(NAPI). The principle is simple enough to split the Interrupt Handler into two parts.

The napi_schedule function responds to IRQ quickly, logging only the necessary information and issuing a soft interrupt, softirQ, when appropriate.
The netrxAction function, executed in a separate process, responds to soft interrupts issued by NAPi_schedule and processes data in RingBuffer in bulk.

Therefore, using the NAPI driver, the process of receiving data can be simply described as:

The NIC receives the data and writes it to memory (Ring Buffer and SK_buff) via DMA.
The NIC issues an interrupt request (IRQ) to tell the kernel that new data is coming.
Driver’s NAPi_schedule function responds to IRQ and issues soft interrupts when appropriate (NET_RX_SOFTIRQ)
The driver’s net_rx_action function responds to a soft interrupt by pulling the received data in batches from the Ring Buffer. The protocol stack is processed, the Socket is filled and handed to the user process.
The system switches to the user mode. Multiple user processes switch to the Running state and are scheduled based on CPU time slices to process data.

In a nutshell: Waiting to receive a batch of data, processing the data in batches again.

4.2.2 Single-threaded I/O multiplexing

The kernel optimization technique for “interprocess context switching” is called “IO multiplexing”, which is very similar to NAPI.

Each socket no longer blocks the process that reads or writes it, but instead uses a dedicated thread to process user-mode data in batches, thus reducing context switching between threads.

As an implementation of IO multiplexing, the principle of SELECT is very simple. All sockets store the process ID that executes the select function. Any socket that receives data wakes up the monitor process. The kernel simply tells the monitor that those sockets are ready, and the monitor can process them in batches.

4.3 Evolution of IO multiplexing

4.3.1 Comparing epoll and Select

Select, Poll, and ePoll are all “IO multiplexing”, so why the performance gap? Space is limited, so let’s just briefly compare the fundamental differences between SELECT and epoll.

For the kernel, there may be many sockets being processed at the same time, and there may be several monitoring processes. So every time a monitoring process “batches data,” it needs to tell the kernel about the socket it cares about. When the kernel wakes up the monitoring process, it can pass the ready socket in the Concerned Socket to the monitoring process.

In other words, when the system call SELECT or epoll_CREATE is executed, the input parameter is “concerned socket” and the output parameter is “ready socket.”

The difference between SELECT and epoll is that:

Select (one O(n) lookup)

An FD_set of user-space allocations passed to the kernel one at a time is used to represent “sockets of concern.” Its structure (equivalent to a bitset) limits it to 1024 sockets.
Each time the socket state changes, the kernel queries O(1) with fd_set to see if the monitoring process cares about the socket.
The kernel reuses fd_set as an output parameter and returns it to the monitoring process (so each select input parameter needs to be reset).

However, the monitoring process must traverse the socket array O(n) to know which sockets are ready.

Epoll (all O(1) lookups)

Pass the kernel one instance handle at a time. The handle is the kernel assigned red-black tree RBR + bidirectional linked list RDLList. As long as the handle stays the same, the kernel can reuse the result of the last calculation.
Each time the socket state changes, the kernel can quickly query O(1) from the RBR to monitor whether the process cares about the socket. Also modify the RDLList, so that the RDLList is actually a cache of ready sockets.
The kernel copies some or all of the RDLList (LT and ET) to a special epoll_event as an output parameter.

So the monitoring process can process data directly one by one, without having to iterate through the validation.

Also, it doesn’t really matter whether the underlying implementation of epoll_create is a red-black tree or not (hashTable could have been used instead). The important thing is that efDS are Pointers, and their data structures can be transparently modified to any other data structure.

4.3.2 Timeline of API release

In addition, let’s take a look at the release timeline of various apis in Network IO. You get two interesting conclusions.

1983 socket released on Unix(4.2 BSD)

1983, SELECT released on Unix(4.2bSD)

In 1994, Linux 1.0 already supported sockets and SELECT

1997 poll was released on Linux 2.1.23

2002, epoll was released on Linux 2.5.44

Socket and SELECT are published at the same time. This shows that SELECT is not used to replace traditional IO. These are two different usages (or models) that apply to different scenarios.

Select, Poll, and epoll, the three “IO multiplexing apis” were released in succession. This shows that they are three evolved versions of IO multiplexing. Because of API design flaws, internal logic cannot be optimized without changing the API. So poll instead of SELECT, and epoll instead of poll.

4.4 summarize

We’ve spent three chapters explaining the principles behind Epoll, and now we can sum it up in three sentences.

Based on the basic principle of data sending and receiving, the system improves CPU utilization by using blocking.
“IO multiplexing” (and NAPI) is designed to optimize live text switching.
Three versions of the API(SELECT, Poll, and epoll) were designed to optimize the interaction between the kernel and the monitoring process.

5 Diss link

With “The theory behind Epoll” behind us, we are ready to answer our first few questions. This is a complete article and many people have urged me to delete the disS section below.

My point of view is that learning is a process of research and understanding. Above is the research, next I will talk about my personal “understanding”, welcome correction.

5.1 Classification of IO Models

The classification of blocking, non-blocking, synchronous, and asynchronous makes sense. But from an operating system perspective ** “this classification is misleading and not good” **.

5.1.1 Blocking and non-blocking

All IO models under Linux are blocked because of the basic principle of sending and receiving data. Blocking user threads is an efficient way to do this.

You could, of course, write a program that sets the socket to non-blocking mode and executes an I/O operation in an infinite loop without using a monitor. But this is so inefficient that it makes no sense.

In other words, blocking is not the problem, running is the problem, and running consumes CPU. IO multiplexing does not reduce blocking, it reduces running. Context switching is the problem. IO multiplexing effectively reduces context switching by reducing the number of processes running.

5.1.2 Synchronous and Asynchronous

All IO models under Linux are synchronized. BIO is synchronous, SELECT is synchronous, Poll is synchronous, ePoll is still synchronous.

Java provides AIO that might be called “asynchronous.” But the JVM runs in user mode, and Linux does not provide any asynchronous support. So the asynchronous support provided by the JVM is not fundamentally different from the “asynchronous” framework that you package yourself (which you can easily package using BIO).

Synchronous and asynchronous are just two event Dispatchers or design patterns (Reactor and Proactor). Since both are running in user mode, how much performance difference can there be between the two design modes?

Reactor corresponds to Java’S NIO, the core API of channels, buffers, and selectors.
Proactor corresponds to Java’s AIO, the API that forms the core of the Async component and Future or Callback.

5.1.3 My classification

I think there are only two types of IO models:

A process model that programmers understand and use;
More in line with the operating system processing logic, IO multiplexing model.

For “I/O multiplexing” event distribution, there are two categories: Reactor and Proactor.

5.2 about the mmap

Does epoll use Mmap at all?

Answer: No!

It was a false rumour. And it’s actually pretty easy to prove, let’s do a demo with epoll. Strace makes it clear.