This article has participated in the call for good writing activities, click to view: back end, big front end double track submission, 20,000 yuan prize pool waiting for you to challenge!

Note: This article is a bit longer, you can read it first!

Today for Linux network packet sent and received in-depth explanation. The following is a simple example of udp socket code:

int main(a){
    int serverSocketFd = socket(AF_INET, SOCK_DGRAM, 0); bind(serverSocketFd, ...) ;char buff[BUFFSIZE];
    int readCount = recvfrom(serverSocketFd, buff, BUFFSIZE, 0,...). ; buff[readCount] ='\ 0';
    printf("Receive from client:%s\n", buff);

}
Copy the code

The above code is a piece of udp server receipt logic. When viewed from the development perspective, as soon as the client sends the corresponding data, the server receives it by recv_FROM and prints it out. What we want to know now is, what happens between when the network packet reaches the nic and when our recvFROM receives the data?

In this article, you will gain an in-depth understanding of how Linux networking systems are implemented internally and how the various parts interact with each other. I believe this will be of great help to your work. This article is based on Linux 3.10, see mirrors.edge.kernel.org/pub/linux/k for source code…

Overview of Linux network packet collection

In the TCP/IP network layered model, the entire protocol stack is divided into physical layer, link layer, network layer, transport layer and application layer. The physical layer corresponds to the network card and network cable, the application layer corresponds to the common Nginx, FTP and other applications. Linux implements the link layer, network layer, and transport layer.

In Linux kernel, the link layer protocol is implemented by nic driver, and the kernel protocol stack is implemented by network layer and transport layer. The kernel provides socket interfaces to the upper application layer for user processes to access. The layered TCP/IP network model we see from a Linux perspective would look something like this.

Figure 1 Linux perspective of network protocol stack In the Linux source code, the network device driver corresponding logic is located in the driver/net/Ethernet, Intel series network card driver in the driver/net/Ethernet/Intel directory. The protocol stack module code is located in the kernel and NET directories.

Kernel and network device drivers are handled through interrupts. When data arrives on the device, a voltage change is triggered on the relevant pins of the CPU, notifying the CPU to process the data. For the network module, the processing process is complicated and time-consuming. If all processing is completed in the interrupt function, the interrupt processing function (with high priority) will occupy the CPU excessively and the CPU will not be able to respond to messages from other devices, such as the mouse and keyboard. So Linux interrupt handlers have a top and bottom half. The top half just does the simplest work, fast processing and then frees up the CPU, which then allows other interrupts to come in. Put most of the rest of the work into the bottom half, where you can take your time. Later versions of the kernel use the second half of the implementation is soft interrupt, which is handled solely by the Ksoftirqd kernel thread. Unlike hard interrupts, which apply voltage changes to physical CPU pins, soft interrupts notify the soft interrupt handler by giving a binary value of a variable in memory.

Now that we know about the NIC driver, hard interrupts, soft interrupts, and KsoftirQD threads, we give a kernel packet receiving path based on these concepts:

When data is received on the network card, the first working module in Linux is the network driver. The network driver writes the frames received on the nic to memory in DMA mode. An interrupt is then issued to the CPU to notify the CPU of the arrival of data. Second, when the CPU receives an interrupt request, it calls the interrupt handler registered by the network driver. The nic’s interrupt handler doesn’t do much work, issuing a soft interrupt request and then releasing the CPU as quickly as possible. Ksoftirqd detects the arrival of a soft interrupt request and invokes poll to poll the received packets. After receiving the packets, ksoftirQD submits them to each protocol stack for processing. For UDP packets, they are placed in the receive queue of the user socket.

We’ve got an overview of how Linux handles packets from the diagram above. But for more details on how the network module works, we have to look further.

2. Linux startup

Linux drivers, kernel protocol stacks, and other modules need to do a lot of work before they can receive packets from network adapters. For example, the ksoftirQD kernel thread should be created in advance, the corresponding processing functions of each protocol should be registered, the network device subsystem should be initialized in advance, and the network card should be started well. And only when these are Ready can we actually start receiving packets. So let’s take a look at how these preparations are done.

2.1 Creating ksoftirQD kernel threads

Linux soft interrupts are done in dedicated kernel threads (KsoftirQD), so it’s important to look at how these processes are initialized so that we can understand the packet collection process more accurately later. The number of processes is not 1, but N, where N is equal to the number of cores on your machine.

Smpboot_register_percpu_thread is called in kernel/smpboot.c during system initialization. The function is further executed to the spawn_ksoftirqd (located in kernel/softirq.c) to create the softirqd process.

Figure 3 creating a KsoftirQD kernel thread

The relevant codes are as follows:

//file: kernel/softirq.c

static struct smp_hotplug_thread softirq_threads = {

    .store          = &ksoftirqd,
    .thread_should_run  = ksoftirqd_should_run,
    .thread_fn      = run_ksoftirqd,
    .thread_comm        = "ksoftirqd/%u"};static __init int spawn_ksoftirqd(void){
    register_cpu_notifier(&cpu_nfb);

    BUG_ON(smpboot_register_percpu_thread(&softirq_threads));
    return 0;

}

early_initcall(spawn_ksoftirqd);
Copy the code

When ksoftirqd is created, it enters its own thread loop functions ksoftirqd_should_run and run_ksoftirqd. Constantly determine if any soft interrupts need to be handled. One thing to note here is that there are other types of soft interrupts, not just network soft interrupts.

//file: include/linux/interrupt.h
enum{
    HI_SOFTIRQ=0,
    TIMER_SOFTIRQ,
    NET_TX_SOFTIRQ,
    NET_RX_SOFTIRQ,
    BLOCK_SOFTIRQ,
    BLOCK_IOPOLL_SOFTIRQ,
    TASKLET_SOFTIRQ,
    SCHED_SOFTIRQ,
    HRTIMER_SOFTIRQ,
    RCU_SOFTIRQ,  
};
Copy the code

2.2 Network Subsystem Initialization

Figure 4 network subsystem initializationThe Linux kernel initializes each subsystem by calling subsys_initcall, and you can grep many calls to this function in the source directory. Here we are talking about the initialization of the network subsystem, which is executed to the net_dev_init function.

//file: net/core/dev.c

static int __init net_dev_init(void){... for_each_possible_cpu(i) {struct softnet_data *sd = &per_cpu(softnet_data, i);

        memset(sd, 0.sizeof(*sd));
        skb_queue_head_init(&sd->input_pkt_queue);
        skb_queue_head_init(&sd->process_queue);
        sd->completion_queue = NULL; INIT_LIST_HEAD(&sd->poll_list); . }... open_softirq(NET_TX_SOFTIRQ, net_tx_action); open_softirq(NET_RX_SOFTIRQ, net_rx_action); } subsys_initcall(net_dev_init);Copy the code

In this function, a Softnet_data data structure is requested for each CPU, and the poll_list in this data structure waits for the driver to register its poll function, which we will see later when the nic driver is initialized.

Open_softirq also registers a handler for each type of soft interrupt. The NET_TX_SOFTIRQ handler is net_tx_action, and the NET_RX_SOFTIRQ handler is net_rx_action. Following up on open_softirq, I found that the registration method was recorded in the softirq_vec variable. Later, when the KsoftirQD thread receives a soft interrupt, it will also use this variable to find the corresponding handler for each soft interrupt.

//file: kernel/softirq.c

void open_softirq(int nr, void (*action)(struct softirq_action *)){

    softirq_vec[nr].action = action;

}
Copy the code

2.3 Protocol stack Registration

The kernel implements IP protocol at the network layer, TCP protocol and UDP protocol at the transport layer. The corresponding implementation functions of these protocols are ip_rCV (), tcp_v4_rCV (), and UDp_rCV (). Unlike the way we write code, the kernel is implemented through registration. Similar to subsys_initCall, fs_initCall in the Linux kernel is the entry point for initialization. Fs_initcall calls inet_init to register the network protocol stack. These functions are registered with inet_protos and ptype_base data structures via inet_init. The diagram below:

Figure 5 AF_INET stack registrationThe relevant codes are as follows

//file: net/ipv4/af_inet.c

static struct packet_type ip_packet_type __read_mostly = {

    .type = cpu_to_be16(ETH_P_IP),
    .func = ip_rcv,};static const struct net_protocol udp_protocol = {
    .handler =  udp_rcv,
    .err_handler =  udp_err,
    .no_policy =    1,
    .netns_ok = 1};static const struct net_protocol tcp_protocol = {
    .early_demux    =   tcp_v4_early_demux,
    .handler    =   tcp_v4_rcv,
    .err_handler    =   tcp_v4_err,
    .no_policy  =   1,
    .netns_ok   =   1};static int __init inet_init(void){...if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
        pr_crit("%s: Cannot add ICMP protocol\n", __func__);
    if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
        pr_crit("%s: Cannot add UDP protocol\n", __func__);
    if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
        pr_crit("%s: Cannot add TCP protocol\n", __func__); . dev_add_pack(&ip_packet_type); }Copy the code

The handler in the UDp_protocol structure is UDp_rCV, and the handler in the tcp_protocol structure is tcp_v4_rCV, which is initialized via inet_add_protocol.

int inet_add_protocol(const struct net_protocol *prot, unsigned char protocol){
    if(! prot->netns_ok) { pr_err("Protocol %u is not namespace aware, cannot register.\n",
            protocol);
        return -EINVAL;
    }

    return! cmpxchg((const struct net_protocol **)&inet_protos[protocol],
            NULL, prot) ? 0 : - 1;

}
Copy the code

The inet_add_protocol function registers both TCP and UDP handlers into the inet_protos array. Then look at dev_add_pack (& ip_packet_type); On this line, type in the IP_packet_type structure is the protocol name and func is the ip_rCV function, which is registered in the ptype_base hash table in dev_add_pack.

//file: net/core/dev.c

void dev_add_pack(struct packet_type *pt){

    struct list_head *head =ptype_head(pt); . }static inline struct list_head *ptype_head(const struct packet_type *pt){

    if (pt->type == htons(ETH_P_ALL))
        return &ptype_all;
    else
        return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];

}
Copy the code

Remember that inet_protos holds the addresses of udp and TCP processing functions, and ptype_base holds the addresses of ip_rcv() processing functions. We will see later that the ip_rcv function address is found through ptype_base in the soft interrupt, and the IP packet is correctly sent to ip_rCV () for execution. In ip_rCV the TCP or UDP handler will be found by inet_protos and the packet will be forwarded to udp_rCV () or tcp_v4_rCV ().

By extension, if you look at the code for functions like ip_rCV and UDp_rCV you can see a lot of protocol processing. For example, netfilter and IPtable filtering are handled in IP_RCV. If you have many or complex Netfilter or iptables rules that are executed in the context of soft interrupts, this will increase network latency. For example, udP_rCV determines whether the socket receive queue is full. The corresponding kernel parameters are net.core.rmem_max and net.core.rmem_default. If you’re interested, I encourage you to read the code for the inet_init function.

2.4 Initializing the NIC Driver

Each driver (not just the nic driver) registers an initialization function with the kernel using module_init, which the kernel calls when the driver is loaded. Such as igb nic driver code is in the drivers/net/Ethernet/Intel/igb igb_main. C

//file: drivers/net/ethernet/intel/igb/igb_main.c

static struct pci_driver igb_driver = {

    .name     = igb_driver_name,
    .id_table = igb_pci_tbl,
    .probe    = igb_probe,
    .remove   = igb_remove,
    ......

};

static int __init igb_init_module(void){... ret = pci_register_driver(&igb_driver);return ret;

}
Copy the code

After the pci_register_driver is invoked, the Linux kernel knows the driver information, such as the ADDRESS of igB_DRIVER_name and IGb_probe of the IGB nic driver. When the nic device is identified, the kernel invokes the probe method of its driver (igB_probe is used for igB_driver). Drive the probe method performs the purpose is to make the equipment ready for igb card, its igb_probe in drivers/net/Ethernet/Intel/igb igb_main. C. The operations are as follows:

Figure 6 Nic driver initializationIn step 5, we saw that the nic driver implements the interface required by ethtool and registers the function address here. When ethtool makes a system call, the kernel finds the callback for the operation. For igb card, its function is realized in the drivers/net/Ethernet/Intel/igb igb_ethtool. C. Do you understand ethtool thoroughly this time? The reason why the ethtool command can view the statistics of received and sent packets, modify the adaptive mode of the NIC, and adjust the number and size of the RX queue is that the ethtool command invokes the corresponding methods of the NIC driver, not the ethtool itself.

The igB_netdev_ops registered in step 6 contains functions such as IGb_open, which are called when the network card is started.

//file: drivers/net/ethernet/intel/igb/igb_main.c

static const struct net_device_ops igb_netdev_ops = {

  .ndo_open               = igb_open,
  .ndo_stop               = igb_close,
  .ndo_start_xmit         = igb_xmit_frame,
  .ndo_get_stats64        = igb_get_stats64,
  .ndo_set_rx_mode        = igb_set_rx_mode,
  .ndo_set_mac_address    = igb_set_mac,
  .ndo_change_mtu         = igb_change_mtu,
  .ndo_do_ioctl           = igb_ioctl,

 ......
Copy the code

In step 7, igB_alloc_q_vector is also called during the igB_probe initialization. He registers a poll function required by the NAPI mechanism, which in the case of the IGB nic driver is IGB_poll, as shown in the following code.

static int igb_alloc_q_vector(struct igb_adapter *adapter,
                  int v_count, int v_idx,
                  int txr_count, int txr_idx,
                  int rxr_count, int rxr_idx){.../* initialize NAPI */
    netif_napi_add(adapter->netdev, &q_vector->napi,
               igb_poll, 64);

}
Copy the code

2.5 Starting a NIC

When all the above initialization is complete, you can start the network card. The structure net_DEVICe_OPS variable is registered with the kernel. It contains the callback functions (function Pointers) of enabling nic, sending packet, setting MAC address, etc. When a network card is enabled (for example, by ifconfig eth0 up), the IGb_open method in net_device_ops is called. It usually does the following:

Figure 7 Starting the network adapter

//file: drivers/net/ethernet/intel/igb/igb_main.c
static int __igb_open(struct net_device *netdev, bool resuming){

    /* allocate transmit descriptors */
    err = igb_setup_all_tx_resources(adapter);

    /* allocate receive descriptors */
    err = igb_setup_all_rx_resources(adapter);

    /* Register interrupt handler */
    err = igb_request_irq(adapter);
    if (err)
        goto err_req_irq;

    /* Enable NAPI */
    for (i = 0; i < adapter->num_q_vectors; i++) napi_enable(&(adapter->q_vector[i]->napi)); . }Copy the code

Above, the __igb_open function calls igb_setup_all_tx_resources, and igb_setup_all_rx_resources. In the igb_setuP_all_rx_resources step, RingBuffer is allocated and a mapping between memory and the Rx queue is established. The number and size of Rx Tx queues can be configured using ethtool. Igb_request_irq = igb_request_irq;

static int igb_request_irq(struct igb_adapter *adapter){
    if (adapter->msix_entries) {
        err = igb_request_msix(adapter);
        if(! err)gotorequest_done; . }}static int igb_request_msix(struct igb_adapter *adapter){...for (i = 0; i < adapter->num_q_vectors; i++) {
        ...
        err = request_irq(adapter->msix_entries[vector].vector,
                  igb_msix_ring, 0, q_vector->name,
    }
Copy the code

In the above code trace function call, __igb_open => IGb_request_irq => IGb_request_msix, in the IGb_request_msix we can see that for a multi-queue network card, an interrupt is registered for each queue, Their corresponding interrupt handler is igb_msix_ring (this function is in the drivers/net/Ethernet/Intel/igb igb_main. C). We can also see that in the MSIX mode, each RX queue has an independent MSI-X interrupt, and the hardware interrupt level of the network card can be set so that the received packets are processed by different CPUS. The CPU binding behavior can be modified by irqbalance or by modifying /proc/irq/irq_number/smp_affinity.

When the above preparations are complete, you are ready to open the door for your guests (packets)!

Three, welcome the arrival of data

3.1 Hard Interrupt Processing

First, when the data frame arrives from the network cable to the network card, the first stop is the receive queue of the network card. The nic looks for available memory in the RingBuffer allocated to it, and then the DMA engine DMA the data to the nic’s associated memory, which the CPU is not aware of. When the DMA operation is complete, the nic issues a hard interrupt to the CPU, notifying it of the arrival of data.

Note: When the RingBuffer is full, incoming packets are discarded. When viewing a network card, ifconfig can have an overruns indicating that the ring queue is full of discarded packets. If packet loss is found, the ethtool command may be used to increase the length of the ring queue. In the starting nic section, we said that the hard interrupt registration handler for the nic is IGb_msix_ring.

//file: drivers/net/ethernet/intel/igb/igb_main.c

static irqreturn_t igb_msix_ring(int irq, void *data){

    struct igb_q_vector *q_vector = data;

    /* Write the ITR value calculated from the previous interrupt. */
    igb_write_itr(q_vector);

    napi_schedule(&q_vector->napi);
    return IRQ_HANDLED;

}
Copy the code

Igb_write_itr simply records the hardware interrupt frequency (supposedly used to reduce interrupt frequency to the CPU). Follow the call to napi_schedule, __napi_schedule=>____napi_schedule

/* Called with irq disabled */

static inline void ____napi_schedule(struct softnet_data *sd,

                     struct napi_struct *napi){
    list_add_tail(&napi->poll_list, &sd->poll_list);
    __raise_softirq_irqoff(NET_RX_SOFTIRQ);

}
Copy the code

Note that list_add_tail modiates the POLL_list in the CPU variable softnet_data to add poll_list from napi_struct. Poll_list in softnet_data is a two-way list where devices have input frames waiting to be processed. __raise_softirq_irqoff then triggers a soft interrupt, NET_RX_SOFTIRQ. This so-called triggering process is just an or operation on a variable.

void __raise_softirq_irqoff(unsigned int nr){
    trace_softirq_raise(nr);
    or_softirq_pending(1UL << nr);

}

//file: include/linux/irq_cpustat.h

#define or_softirq_pending(x)  (local_softirq_pending() |= (x))
Copy the code

As we said, Linux only does the simple necessary work in hard interrupts, and most of the rest of the processing is handed over to soft interrupts. As you can see from the code above, hard interrupt handling is really quite short. Just log a register, modify the CPU’s poll_list a little bit, and then issue a soft interrupt. That’s it. Hard interrupts are done.

3.2 Ksoftirqd kernel threads handle soft interrupts

When the kernel thread is initialized, we introduce two thread functions in Ksoftirqd: ksoftirqd_should_run and run_ksoftirqd. Ksoftirqd_should_run:

static int ksoftirqd_should_run(unsigned int cpu){
    return local_softirq_pending();

}

#define local_softirq_pending() \    __IRQ_STAT(smp_processor_id(), __softirq_pending)
Copy the code

Here you see the same function called local_softirq_pending as in the hard interrupt. The difference is that the hard interrupt position is for writing tokens, here it is just for reading. If the hard interrupt is set to NET_RX_SOFTIRQ, this can be read naturally. The run_ksoftirqd process will then actually enter the thread function:

static void run_ksoftirqd(unsigned int cpu){
    local_irq_disable();
    if (local_softirq_pending()) {
        __do_softirq();
        rcu_note_context_switch(cpu);
        local_irq_enable();
        cond_resched();
        return; } local_irq_enable(); } In __do_softirq, call its registered action method based on the current CPU's soft interrupt type. asmlinkagevoid __do_softirq(void) {do {
        if (pending & 1) {
            unsigned int vec_nr = h - softirq_vec;
            intprev_count = preempt_count(); . trace_softirq_entry(vec_nr); h->action(h); trace_softirq_exit(vec_nr); . } h++; pending >>=1;
    } while (pending);

}
Copy the code

In the network subsystem initialization section, we see that we registered the handler function net_rx_action for NET_RX_SOFTIRQ. So the net_rx_action function will be executed.

Note that the soft interrupt flag is set in the hard interrupt and ksoftirq determines whether a soft interrupt has arrived based on smP_processor_id (). This means that as long as the CPU on which the hard interrupt is being responded to, the soft interrupt is also being processed on that CPU. So, if you find that your Linux soft interrupt CPU consumption is concentrated on one core, the idea is to adjust the CPU affinity of the hard interrupt to break the hard interrupt into different CPU cores.

Let’s focus again on the core function, net_rx_Action.

static void net_rx_action(struct softirq_action *h){
    struct softnet_data *sd = &__get_cpu_var(softnet_data);
    unsigned long time_limit = jiffies + 2;
    int budget = netdev_budget;
    void *have;

    local_irq_disable();
    while(! list_empty(&sd->poll_list)) { ...... n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list); work =0;
        if(test_bit(NAPI_STATE_SCHED, &n->state)) { work = n->poll(n, weight); trace_napi_poll(n); } budget -= work; }}Copy the code

Time_limit and budget at the beginning of the net_rx_action function are used to control the exit of the net_rx_action function to ensure that the receiving of network packets does not hog the CPU. Wait for the next nic to have a hard interrupt over the time to process the rest of the received packets. Budget can be adjusted by kernel parameters. The rest of the core logic in this function is to get the current CPU variable softnet_data, iterate over its poll_list, and then execute to the poll function registered with the NIC driver. For IGB network cards, it is the IGB_poll function of the IGB driving force.

static int igb_poll(struct napi_struct *napi, int budget){...if (q_vector->tx.ring)
        clean_complete = igb_clean_tx_irq(q_vector);

    if(q_vector->rx.ring) clean_complete &= igb_clean_rx_irq(q_vector, budget); . }Copy the code

The main work of IGB_poll in the read operation is the call to IGB_clean_Rx_IRq.

static bool igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget){...do {
        /* retrieve a buffer from the ring */
        skb = igb_fetch_rx_buffer(rx_ring, rx_desc, skb);

        /* fetch next buffer in frame if non-eop */
        if (igb_is_non_eop(rx_ring, rx_desc))
            continue;
        }

        /* verify the packet layout is correct */
        if (igb_cleanup_headers(rx_ring, rx_desc, skb)) {
            skb = NULL;
            continue;
        }

        /* populate checksum, timestamp, VLAN, and protocol */
        igb_process_skb_fields(rx_ring, rx_desc, skb);

        napi_gro_receive(&q_vector->napi, skb);
}
Copy the code

The igB_FETch_RX_buffer and IGB_IS_NON_Eop are used to fetch data frames from RingBuffer. Why do we need two functions? Since it is possible for frames to occupy more than one RingBuffer, they are fetched in a loop until the end of the frame. An acquired data frame is represented by an SK_buff. After receiving the data, perform some verification on it, and then set fields such as TIMESTAMP, VLAN ID and Protocol of SBK variable. Next, go to napi_gro_receive:

//file: net/core/dev.c

gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb){

    skb_gro_reset_offset(skb);
    return napi_skb_finish(dev_gro_receive(napi, skb), skb);

}
Copy the code

The dev_gro_receive function represents the NETWORK GRO feature, which can be simply interpreted as merging related packets into a single large packet. The purpose is to reduce the number of packets sent to the network stack, which helps reduce CPU usage. Let’s ignore napi_SKb_finish, which calls netif_receive_SKb.

//file: net/core/dev.c

static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb){

    switch (ret) {
    case GRO_NORMAL:
        if (netif_receive_skb(skb))
            ret = GRO_DROP;
        break; . }Copy the code

In netif_receive_SKb, packets are sent to the protocol stack. Note that the following 3.3, 3.4, 3.5 also belong to the soft interrupt processing process, but because of the length is too long, separate out into a section.

3.3 Network protocol stack processing

The netif_receive_SKb function sends udp packets to the ip_rCV () and UDP_rCV () protocol processing functions for processing.

Figure 10 network protocol stack processing

//file: net/core/dev.c

int netif_receive_skb(struct sk_buff *skb){

    //RPS processing logic, ignore...... first
    return __netif_receive_skb(skb);

}

static int __netif_receive_skb(struct sk_buff *skb){

    ......  
    ret = __netif_receive_skb_core(skb, false); }static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc){
    ......

    // pCAP logic, where data is sent to the capture point. List_for_each_entry_rcu (ptype, &ptype_all, list) {
        if(! ptype->dev || ptype->dev == skb->dev) {if(pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; }}... list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK],list) {
        if (ptype->type == type &&
            (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
             ptype->dev == orig_dev)) {
            if(pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; }}}Copy the code

In __netif_receive_SKb_core, I looked at the old tcpdump capture points and was excited to read the source code. __netif_receive_SKb_core then retrieves protocol from the packet and iterates through the list of callback functions registered with the protocol. Ptype_base is a hash table, as mentioned in the protocol Registration section. The address of the ip_rcv function is in this hash table.

//file: net/core/dev.c

static inline int deliver_skb(struct sk_buff *skb, struct packet_type *pt_prev, struct net_device *orig_dev){...return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);

}
Copy the code

The pt_prev->func line calls the handler registered with the protocol layer. For IP packets, it goes to IP_rCV (or arp_rCV for ARP packets).

3.4 IP protocol Layer processing

Let’s take a look at what Linux does at the IP protocol layer and how packets are further sent to udp or TCP protocol handlers.

//file: net/ipv4/ip_input.c

int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev){...return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL,
               ip_rcv_finish);

}
Copy the code

NF_HOOK is a hook function. After executing the registered hook, the function ip_rcv_finish is executed.

static int ip_rcv_finish(struct sk_buff *skb){...if(! skb_dst(skb)) {interr = ip_route_input_noref(skb, iph->daddr, iph->saddr, iph->tos, skb->dev); . }...return dst_input(skb);

}
Copy the code

Trace ip_route_input_noref and see that it calls ip_route_input_mc. In ip_route_input_mc, the function ip_local_deliver is assigned to dst.input as follows:

//file: net/ipv4/route.c

static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,u8 tos, struct net_device *dev, int our){

    if(our) { rth->dst.input= ip_local_deliver; rth->rt_flags |= RTCF_LOCAL; }} so go back to ip_rcv_finishreturndst_input(skb); ./* Input packet from network to transport. */

static inline int dst_input(struct sk_buff *skb){

    return skb_dst(skb)->input(skb);

}
Copy the code

The input method called by skb_dst(SKB)-> INPUT is the IP_LOCAL_DELIVER assigned by the routing subsystem.

//file: net/ipv4/ip_input.c

int ip_local_deliver(struct sk_buff *skb){

    /* * Reassemble IP fragments. */
    if (ip_is_fragment(ip_hdr(skb))) {
        if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
            return 0;
    }

    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,
               ip_local_deliver_finish);

}

static int ip_local_deliver_finish(struct sk_buff *skb){...int protocol = ip_hdr(skb)->protocol;
    const struct net_protocol *ipprot;

    ipprot = rcu_dereference(inet_protos[protocol]);
    if(ipprot ! =NULL) { ret = ipprot->handler(skb); }}Copy the code

As seen in the protocol registration section, inet_protos stores the function addresses of tcp_rCV () and UDp_rCV (). The distribution will be selected based on the protocol type in the packet, where SKB packets will be further dispatched to the higher protocols, UDP and TCP.

3.5 UDP Layer Processing

As mentioned in the protocol Registration section, the udp protocol handler is UDP_rcv.

//file: net/ipv4/udp.c

int udp_rcv(struct sk_buff *skb){

    return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);

}
int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,

           int proto){
    sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);

    if(sk ! =NULL) {
        int ret = udp_queue_rcv_skb(sk, skb
    }
    icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);

}
Copy the code

__udp4_lib_lookup_skb looks for the corresponding socket based on the SKB, and puts the packet into the socket cache queue when it is found. If not, an ICMP packet with unreachable destination is sent.

//file: net/ipv4/udp.c

int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb){...if (sk_rcvqueues_full(sk, skb, sk->sk_rcvbuf))
        goto drop;

    rc = 0;

    ipv4_pktinfo_prepare(skb);
    bh_lock_sock(sk);
    if(! sock_owned_by_user(sk)) rc = __udp_queue_rcv_skb(sk, skb);else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
        bh_unlock_sock(sk);
        goto drop;
    }
    bh_unlock_sock(sk);
    return rc;

}
Copy the code

Sock_owned_by_user determines whether the user is making a system call on the socker (the socket is occupied). If not, it can be placed directly on the socket’s receive queue. If so, add packets to the backlog queue via sk_add_backlog. When the user releases the socket, the kernel checks the backlog queue and moves it to the receive queue if it has any data.

Sk_rcvqueues_full If the receive queue is full, packets are discarded. The receive queue size is affected by the kernel parameters net.core.rmem_max and net.core.rmem_default.

Recvfrom system call

Two flowers, one on each table. The entire Linux kernel receives and processes packets, and finally puts the packets into the socket receive queue. So let’s go back to what happens when the user process calls recvfrom. The recvfrom we call in our code is a glibc library function that, when executed, will plunge the user into kernel mode and into the Linux implementation of the system call sys_recvfrom. Before we get into Linux’s understanding of sys_revvfrom, let’s take a quick look at the socket core data structure. This data structure is so large that we will only draw pairs that are relevant to our topic today, as follows:

FIG. 11 Const struct proto_ops in the socket data structure corresponds to the set of methods of the protocol. Each protocol implements a different set of methods. For the IPv4 Internet protocol family, each protocol has a corresponding processing method, as follows: For UDP, this is defined through inet_dgram_ops, where the inet_recvmsg method is registered.

//file: net/ipv4/af_inet.c

const struct proto_ops inet_stream_ops ={... .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, ...... }const struct proto_ops inet_dgram_ops = {

    ......
    .sendmsg       = inet_sendmsg,
    .recvmsg       = inet_recvmsg,
    ......

}
Copy the code

Struct sock * SK is a very large and important substructure. Sk_prot, in turn, defines a secondary handler. For UDP, it is set to the UDP implementation method set UDP_prot.

//file: net/ipv4/udp.c

struct proto udp_prot = {

    .name          = "UDP",
    .owner         = THIS_MODULE,
    .close         = udp_lib_close,
    .connect       = ip4_datagram_connect,
    ......
    .sendmsg       = udp_sendmsg,
    .recvmsg       = udp_recvmsg,
    .sendpage      = udp_sendpage,
    ......

}
Copy the code

After looking at socket variables, let’s look at the implementation of sys_revvfrom.

Sk ->sk_prot->recvmsg is called at inet_recvmsg.

//file: net/ipv4/af_inet.c

int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,size_t size, int flags){... err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len);if (err >= 0)
        msg->msg_namelen = addr_len;
    return err;

}
Copy the code

Struct proto udp_prot net/ipv4/udp.c This leads us to the udp_recvmsg method.

//file:net/core/datagram.c:EXPORT_SYMBOL(__skb_recv_datagram);

struct sk_buff* __skb_recv_datagram(struct sock *sk.unsigned int flags.int *peeked.int *off.int *err) {.do {
        struct sk_buff_head *queue = &sk->sk_receive_queue;
        skb_queue_walk(queue, skb) {
            ......
        }

        /* User doesn't want to wait */
        error = -EAGAIN;
        if(! timeo)goto no_packet;
    } while(! wait_for_more_packets(sk, err, &timeo, last)); }Copy the code

Finally we find the focus we want to look at, above we see the so-called read process, which is sk-> SK_receive_queue. If there is no data and the user is allowed to wait, wait_for_more_packets() is called to wait, which puts the user process to sleep.

Five, the summary

The network module is the most complex module in the Linux kernel. It seems that a simple packet collecting process involves the interaction between many kernel components, such as network card driver, protocol stack, kernel KsoftirQD thread and so on. As complex as it may seem, this article tries to illustrate the kernel packet collection process in an easy-to-understand way by using diagrams. Now let’s go through the whole packet collection process again.

When the user completes the recvFROM call, the user process goes kernel-mode through the system call. If there is no data in the receive queue, the process goes to sleep and is suspended by the operating system. This one is relatively easy, and most of the rest is performed by other modules of the Linux kernel.

First, Linux does a lot of work before it can start collecting packages:

  1. Create a KsoftirQd thread, set it up with its own thread function, and then expect it to handle soft interrupts

  2. Protocol stack registration, Linux to achieve many protocols, such as ARP, ICMP, IP, UDP, TCP, each protocol will register its own processing function, convenient package to quickly find the corresponding processing function

  3. Nic driver initialization, each driver has an initialization function, the kernel will let the driver initialization. During this initialization, get your DMA ready and tell the kernel the poll address of the NAPI

  4. Start network card, allocate RX, TX queue, register interrupt corresponding handler function

This is all important before the kernel is ready to receive packets. When all the above are ready, you can turn on the hard interrupt and wait for the packets to arrive.

When the data arrives, the first thing that greets it is a network card:

  1. The nic DMA the data frame into the memory’s RingBuffer and then sends an interrupt notification to the CPU

  2. In response to the interrupt request, the CPU invokes the interrupt handler registered when the nic is started

  3. The interrupt handler does little to initiate a soft interrupt request

  4. The kernel thread ksoftirqd thread detects a soft interrupt request and closes the hard interrupt first

  5. The Ksoftirqd thread starts calling the driver’s poll function to collect packets

  6. The poll function sends the received packet to the ip_rCV function registered with the protocol stack

  7. The ip_rCV function sends packets to the UDp_rCV function (for TCP packets it sends packets to tcp_rCV).

Now we can go back to the beginning of the problem, the simple line recvfrom that we saw at the user level. The Linux kernel has to do so much work for us to get the data back. This is simple UDP, if TCP, the kernel has to do more work, I can’t help but sigh that the kernel developers are really well-intented.

Now that we understand the entire packet collection process, we have a clear idea of the CPU overhead of a packet collection in Linux. The first is the overhead of user process call system calls falling into kernel-state. The second block is the CPU cost of hard interrupts of CPU response packets. The third is the cost of the soft interrupt context of the KsoftirQD kernel thread. We’ll post another article looking at these costs in practice.

In addition, there are many details in the network transceiver that we haven’t discussed, such as NO NAPI, GRO, RPS, etc. Because I feel that being too right will affect everyone’s understanding of the whole process, so try to keep the main frame as much as possible, less is more!