The socket and TCP connection process is not unknown

Categories: Linux miscellaneous, Linux Services, Web architecture, General knowledge of computer languages

undefined

This article mainly describes the TCP connection process, each stage of the socket operation, I hope to have no network programming foundation to understand what the socket is, the role of help. Please point out any errors you find

background

1. The TCP stack maintains two socket buffers: the Send Buffer and the RECV buffer.

The data to be sent over the TCP connection is copied to the Send buffer, either from the APP buffer of the user-space process or from the kernel buffer. The copying process is completed by the send() function. Since it is also possible to write data using the write() function, this process is also called writing data, and the corresponding send buffer is also called write Buffer. However, send() is more efficient than write().

Therefore, the data in the Send buffer must be copied to the network adapter. Because one end is memory and the other is network card device, you can copy directly by DMA without CPU. In other words, the data in the send buffer is DMA copied to the network card and transmitted over the network to the other end of the TCP connection: the receiver.

When receiving data over a TCP connection, the data must first flow in through the network card and then be copied to the RECv buffer by DMA. The recv() function copies the data from the RECv buffer to the APP buffer of the user-space process.

The general process is shown as follows:

2. Two types of sockets: listening socket and connected socket.

A listening socket is created by the socket() function when the server process reads the configuration file and resolves the address and port to listen on. Then, bind() is used to bind the listening socket to the corresponding address and port. The process/thread can then listen on the port (strictly monitoring the listening socket) through the listen() function.

A connected socket is a socket that is returned by accept() after listening for a TCP connection request and shaking hands three times. Subsequent processes/threads can use this connected socket to communicate TCP with the client.

To distinguish between the two socket descriptors returned by the socket() and accept() functions, some people use listenfd and connfd to represent the listening and connected sockets respectively, which is fairly descriptive and occasionally used in the following sections.

The following is to illustrate the role of various functions, analysis of these functions, but also in the process of connecting, disconnecting.

Specific process analysis of the connection

The diagram below:

The socket () function

The socket() function creates a socket file descriptor sockfd(Socket () creates an endpoint for communication and returns a descriptor). This socket descriptor can be used as a binding object for the bind() function later.

The bind () function

The server can use bind() to bind this socket to the address and port combination “addr:port” by analyzing the configuration file and resolving the address and port it wants to listen on, together with a socket sockfd that can be generated by socket(). A socket bound to a port can be listened to by the LISTEN () function.

A socket bound to an address and port has a source address and a source port (source to the server itself), plus three of the quintuples through the protocol type specified in the configuration file. That is:

{protocal,src_addr,src_port}
Copy the code

However, it is common to see that some services can be configured to listen for multiple addresses and ports to achieve multiple instances. This is essentially done by generating and binding multiple sockets through multiple socket()+bind() system calls.

Listen () and connect()

As the name implies, listen() listens for sockets that have been bound to addr+port by bind(). After listening, the socket changes from the CLOSE state to the LISTEN state, and the socket can provide a window for the TCP connection.

The connect() function is used to initiate a connection request to a listened socket, the three-way TCP handshake. As you can see, the connect() function is used by the connection requester (such as the client), and of course, the connection initiator also needs to generate a SOCKFD before it can initiate connect(), probably using a socket bound to a random port. Since connect() initiates a connection to a socket, it is natural to use connect() with the destination of the connection, namely the destination address and destination port, which are the addresses and ports bound to the socket that the server listens for. It also carries its own address and port, which for the server is the source address and port of the connection request. Thus, sockets at both ends of the TCP connection are in the full quintuple format.

Deeper analysis of Listen ()

Let’s talk more about the listen() function. Poll (); poll(); poll(); poll(); poll(); Only one socket descriptor is of interest to select() or poll().

Whether the select() or poll() mode is used (no need to say more about the different monitoring methods of epoll), it blocks on select() or poll() while the process/thread (listener) is listening. Until a SYN message is written to the SOCKFD (recV Buffer) it is listening to, the kernel wakes up (note not the APP process wakes up, because the TCP three-way handshake and four-way wave are done in kernel space by the kernel). No user space is involved) copies the SYN data to the kernel buffer for processing (for example, checking whether the SYN is reasonable), and prepares THE SYN+ACK data. The data needs to be copied from the kernel buffer to the send buffer and then sent to the network card. A new item is created for this connection in the syn queue and set to the SYN_RECV state. Then, select()/poll() is used to monitor the socket listenfd again until data is written to the socket again, and the kernel is woken up again. If the data written this time is ACK information, it means that a client responds to the SYN sent by the server kernel. Therefore, the data is copied to the kernel buffer for processing, and the corresponding items in the pending connection queue are moved to the connection completed queue (Accept queue/ Established queue) and the state is set to ESTABLISHED. If the ACK is not received this time, It must be SYN, which is a new connection request, and it is placed in the connection incomplete queue as it was done above. For connections that have been placed in the completed queue, the kernel will wait to consume them through the Accept () function (accept() is invoked by the user-space process, and the kernel will consume them). Any connection that has been accepted () will be removed from the completed queue, indicating that TCP has been established. User-space processes on both ends can actually transfer data through this connection until four waves of close() or shutdown() are used to close the connection, with no kernel involvement in between. This is the loop through which the listener processes the entire TCP connection.

That is, listen() also maintains two queues: the syn queue and the Accept queue. When a listener receives a SYN from a client and replies with a SYN+ACK, an entry about the client is created at the end of the pending connection queue and its status is set to SYN_RECV. Obviously, this entry must contain the client’s address and port information (probably hashed, I’m not sure). When the server receives the ACK message from the client again, the listener thread analyzes the data to determine which item in the pending connection queue the message is sent to, moves the item to the completed connection queue, and sets its state to ESTABLISHED. Finally, wait for the kernel to consume the connection using the accept() function. From there, the kernel temporarily exits the stage until 4 waves.

For example, it is easy to understand these two queues: the queue for checking in is the unfinished queue; after checking in, the queue enters the waiting hall; the number of seats in the waiting hall is the length of the completed queue; when the bus arrives and takes away some passengers from the waiting hall, it is Accept () to consume from the completed queue.

When the pending connection queue is full, the listener is blocked from receiving new connection requests and waits for both queues to trigger writable events via select()/poll(). When the completed connection queue is full, the listener does not receive new connection requests, and the action to move to the completed connection queue is blocked. Prior to Linux 2.2, the Listen () function had a backlog parameter that set the maximum total length of the two queues (actually there is only one queue, but there are two states, see “tip” below). Starting with Linux 2.2, This parameter only represents the maximum length of the accept queue (that is, the number of seats restricted in the waiting hall), The /proc/sys/net/ipv4/tcp_max_syn_backlog is used to set the maximum length of the syn queue/ SYN backlog. The/proc/sys/net/core/somaxconn hard limits the maximum length of queue has been completed and the default is 128, if greater than somaxconn backlog parameters, the backlog will be truncated to the hard limit. In other words, the maximum length of the completed queue is min(Backlog, somaxconn). I find a more complete introduction to www.programmersought.com/article/384… .

The program can set the number of seats in the waiting hall by listening (). However, there is no limit on the length of the check-in queue, so the length of the unfinished queue cannot be set in Listen (). However, there are hard provisions on the station, which stipulate that the waiting hall can only set up N waiting Spaces at most, and the ticket queue can only queue up N people at most to check in (imagine that the station space is limited, can only queue up to the station door), these hard provisions, need to modify the kernel parameters to set.

When a connection in the queue is accepted (), the TCP connection is established. The connection uses its socket buffer to transmit data to the client. Both the socket buffer and the socket buffer that listens for a socket are used to store TCP incoming and outgoing data, but their meanings are different. The socket buffer that listens for a socket accepts only SYN and ACK data during TCP connection requests. The socket buffer of the ESTABLISHED TCP connection mainly stores “formal” data transmitted at both ends, such as the response data constructed by the server and THE Http request data initiated by the client.

Fact: Two types of TCP sockets

There are actually two different types of TCP socket implementations. The type of use of the two queues described above is one that has been used since Linux 2.2. Another (BSD-derived) socket type takes a single queue that holds all the connections during the three handshakes, but each connection in the queue has two states: SYN-RECV and ESTABLISHED.

Description of recV-Q and Send-Q

The send-q and recv-q columns of the netstat command represent the contents of the socket buffer.

1 2 3 4 5 Recv-Q Established: The count of bytes not copied by the user program connected to this socket. Listening: Since Kernel 2.6.18 This column contains the current syn backlog. Send -q Established: The count of bytes not acknowledged by the remote host. Listening: Since Kernel 2.6.18 This column contains the maximum size of the SYN backlog.Copy the code

For the listening socket, recv-q represents the current SYN backlog, that is, the number of SYN messages accumulated, that is, the number of connections in the incomplete queue. Send-q represents the maximum syn backlog, that is, the maximum number of connections in the incomplete queue. For established TCP connections, the recV-Q column indicates the size of the Recv buffer that has not been copied by the user process, and the Send-Q column indicates the size of the REMOTE host that has not returned an ACK message.

The difference between a socket that has established a TCP connection and a socket that is in the listening state is that these two sockets use different socket buffers. The listening socket pays more attention to the length of the queue, while the socket that establishes a TCP connection pays more attention to the size of the received and sent data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [root@xuexi ~]# netstat -tnl Active Internet connections (only servers) Proto Recv-Q Send -q Local Address Foreign Address State TCP 0 0 0.0.0.0:22 0.0.0.0:* LISTEN TCP 0 0 127.0.0.1:25 0.0.0.0:* LISTEN tcp6 0 0 :::80 :::* LISTEN tcp6 0 0 :::22 :::* LISTEN tcp6 0 0 ::1:25 :::* LISTEN [root@xuexi ~]# ss -tnl State Recv-Q Send -q Local Address:Port Peer Address:Port LISTEN 0 128 *:22 *:* LISTEN 0 100 127.0.0.1:25 *:* LISTEN 0 128 :::80 ::* LISTEN 0 128 :::22 :::* LISTEN 0 100 ::1:25 :::*Copy the code

Note that the send-q/recv-q of netstat is different from the send-q/recv-q of ss for Listen sockets because netstat does not specify the maximum length of the pending queue at all. Furthermore, the recv-q and send-q of SS represent the current and maximum lengths of the connection completed queue (that is, waiting for Accept ()), respectively. In other words, the netstat and SS columns have opposite meanings, one is an incomplete queue and the other is a completed queue.

Impact of SYN flood attacks

In addition, if a listener does not receive an ACK message from the client after sending a SYN+ACK message, the listener is awakened by the timeout period set by select()/poll() and sends a SYN+ACK message to the client again to prevent it from being lost in the network. If a client calls connect() with a fake source address, the listener’s SYN ACK message will never reach the host. However, whether the listener is woken up again and again due to the timeout set by SELECT ()/poll(), or whether the data is copied to the Send buffer again and again, the CPU is required to participate in the process, and the SYN+ACK in the send buffer is copied to the network card (this time a DMA copy). No CPU required). If the client were an attacker, sending thousands or thousands of SYN packets, the listener would almost immediately crash and the network card would be heavily blocked. This is called a SYN flood attack.

For example, reduce the maximum length of the two queues maintained by LISTEN (), reduce The Times of resending SYN + ACK packets, increase the resending interval, reduce the timeout time for receiving ACK packets, and use Syncookie. But either approach of directly modifying TCP options does not balance performance and efficiency well. So filtering packets before the connection reaches the listener thread is extremely important.

Accept () function

The accPET () function reads the first entry in the completed connection queue (removed from the queue upon reading) and generates a socket descriptor for subsequent connections, assuming connfd. With a new connection socket, the worker process/thread (called worker) can communicate with the client through this connection socket, while the listener socket (SOCKFD) is still listened on by the listener.

For example, in prefork mode HTTPD, each child process is both listener and worker. When a client makes a connection request, the child process receives it while listening and releases the listening socket so that other child processes can listen on the socket. After many iterations, the accPET () function generates a new connection socket through which the child process can focus on interacting with the client, although it may be blocked or sleep several times due to various IO waits. This is really inefficient, considering that the child process blocks over and over again from the time it receives a SYN message to the time it finally generates a new connection socket. Of course, you can set the listening socket to non-blocking IO mode, but even in non-blocking mode, it constantly checks the state.

Considering the worker/event processing pattern, each child process uses a dedicated listener thread and N worker threads. The listener thread is specifically responsible for listening and setting up new connection socket descriptors to place in Apache’s socket queue. In this way, the listener and the worker are separated, and the worker can still work freely during the monitoring process. Worker/Event mode performs significantly better than Prefork mode in terms of just listening.

When a listener makes the accept() system call, if there is no data in the completed connection queue, the listener is blocked. Of course, the socket can be set to non-blocking mode, where ACCEPT () returns an error EWOULDBLOCK or EAGAIN if the data is not available. You can use select() or poll() or epoll to wait for readable events on the completed connection queue. You can also set the socket to signal-driven IO mode to have new data added to the completed connection queue notify the listener to copy the data to the APP buffer and process it using Accept ().

I often hear the concepts of synchronous and asynchronous connections. how do they differ? Synchronous connection means that, from the time a listener listens to SYN data sent by a client, it must wait until the connection socket is established and the client data interaction is complete, and no other client connection requests are received until the connection with this client is closed. To be specific, the socket buffer and app buffer data need to be consistent during synchronous connection. Typically, the listener and worker are the same process when handled synchronously, as in HTTPD’s Prefork model. Asynchronous connections can receive and process other connection requests at any stage of connection establishment and data interaction. Usually, asynchronous connection is used when listener and worker are not the same process, such as HTTPD’s Event model. Although listener and worker are separated in worker model, synchronous connection is still used. After listener accesses connection request and creates connection socket, it is immediately handed to worker thread. The processing of the worker thread has been the only until the client disconnects, and event model of the asynchronous is only special connection in a worker thread processing (such as in a long connection state of the connection), it can be to monitor thread in trust, for the normal connection, it is equivalent to the synchronous connection way, So HTTPD events are asynchronous, but they are pseudo-asynchronous. Colloquially and loosely, a synchronous connection is one process/thread handling one connection, and an asynchronous connection is one process/thread handling multiple connections.

The relationship between TCP connections and sockets

To be clear, both ends of each TCP connection are associated with a socket and the file descriptor that socket points to.

As mentioned earlier, when the server receives the ACK message, it indicates that the three-way handshake is complete, indicating that the TCP connection with the client has been established. Once the connection is established, the TCP connection is placed in the Established queue opened by Listen () to wait for Accept () to consume it. The socket that the TCP connection is associated with on the server side is the LISTEN socket and the file descriptor it points to.

When a TCP connection in the Established Queue is consumed by Accept (), the TCP connection is associated with the socket specified by Accept () and assigned a new file descriptor. That is, after accept(), the connection has nothing to do with the Listen socket.

In other words, the connection is the same connection, but the server has surreptitiously replaced the socket and file descriptor associated with the TCP connection without the client knowing about it. However, this does not affect the communication between the two parties because the data transfer is based on the connection, not the socket, and as long as the data can be put into the “pipe” of the TCP connection from the file descriptor, the data can reach the other end.

In fact, you don’t necessarily need accept() for TCP communication to take place, because the connection is established before Accept (), but it is associated with the file descriptor for the LISTEN socket, which only recognizes the data involved in the three handshakes and four waves. And the data in this socket is the responsibility of the operating system kernel. Imagine that with listen() and no accept(), the client continues to connect(), and the server will continue to connect without doing anything until the listen queue is full.

Send () and recv() functions

The send() function copies data from the APP buffer to the Send buffer (or, of course, directly from the kernel buffer), and the recv() function copies data from the RECv buffer to the APP buffer. Of course, for TCP sockets, the write() and read() functions are more commonly used to send and read socket buffer data, but the use of send()/recv() indicates that their names are more specific.

Both of these functions involve socket buffers, but when send() or recv() is called, it is necessary to consider whether there is data in the copied source buffer and whether the copied target buffer is full and therefore unwritable. Either way, if the conditions are not met, the process/thread will block when send()/recv() is called (assuming the socket is set to block IO model). Of course, the socket can be set to a non-blocking IO model where send()/recv() is called when buffer is not satisfied and the process/thread calling the function returns the error status EWOULDBLOCK or EAGAIN. Select ()/poll()/epoll Can be used to monitor file descriptors (socket buffers are monitored). Send ()/recv() will work. You can also set the socket to a signal-driven OR asynchronous IO model, so that you don’t have to do all the work of calling send()/recv() until the data is ready and copied.

The close(), shutdown() functions

The generic close() function can close a file descriptor, as well as connection-oriented network socket descriptors. When close() is called, all the data in the send buffer will be attempted. But close() simply subtracts the socket reference count by one, just as RM removes only one hard link when deleting a file. Only when all the socket reference counts are removed does the socket descriptor really close and begin the subsequent four waves. For the father and son to share the socket of the concurrent service program, call the close () to close the child process of the socket is not really close the socket, because the socket of the parent is still in the open, if the parent has not call close () function, then the socket will have been in the open state, the right into the four times to wave process.

Unlike close(), which subtracts the reference count by one, shutdown() directly cuts off all connections to the socket, causing four waves. Three shutdown modes can be specified:

1. Disable write. In this case, no more data can be written to the Send Buffer. The existing data in the Send Buffer will be sent until the end. 2. Close the read. In this case, data cannot be read from the recv buffer, and the existing data in the RECV buffer can only be discarded. 3. Close the read and write. In this case, data cannot be read or written. The existing data in the Send buffer is sent until the end, but the existing data in the Recv buffer is discarded.

Each time they are called, either shutdown() or close(), they send a FIN during the actual four waves.

Address/port reuse technology

Normally, a addr+port can only be bound to one socket. In other words, addr+port cannot be reused. Different sockets can only be bound to different addr+ ports. For example, if you want to start two SSHD instances, the same ADDR +port must not be configured in the SSHD instance configuration file. Similarly, when configuring web virtual hosts, two virtual hosts must not be configured with the same ADDR +port unless it is based on the domain name. The reason why a virtual host based on the domain name can be bound to the same ADDR +port is that the HTTP request message contains the host name information. In fact, when such connection requests arrive, It is still listening through the same socket, but once it is listening, the HTTPD worker process/thread can assign the connection to the corresponding host.

Since this is the normal case, there is of course the abnormal case, which is address reuse and port reuse techniques, which together are socket reuse. In the current Linux kernel, the socket option SO_REUSEADDR supports address reuse and the socket option SO_REUSEPORT supports port reuse. After setting the port reuse option, unbind the socket and there will be no more errors. In addition, once an instance is bound to two ADDR + ports (multiple, for example), two listener processes/threads can be used to listen on each of them at the same time, and connections sent by clients can be received in turn through round-robin balancing algorithms.

For a listening process/thread, the socket reused each time is called a listener bucket, that is, each listening socket is a listening bucket.

Taking the Worker or event model of HTTPD as an example, assume that there are currently three child processes, each with a listener thread and N worker threads.

So, in the absence of address reuse, the listener threads are scrambling to listen. At a certain point, only one listener thread can listen on the socket (by acquiring the mutex). When this listener thread receives the request, it cedes the listening status, so that other listener threads can steal the listening status, and only one thread can steal the listening status. The diagram below:

When using address reuse and port reuse techniques, it is possible to bind multiple sockets to the same ADDR +port. For example, in the following figure, there are two sockets with one more listening bucket, so two listening threads can listen at the same time. When one listening thread receives the request, it gives up the qualification, and other listening threads fight for the qualification.

If one more socket is bound, then the three listening threads do not have to surrender the listening status, can listen indefinitely. The diagram below.

It seems that the performance is good, not only reducing the contention for listener status (mutex) and avoiding “hunger issues”, but also making listening more efficient and reducing the stress on listening threads because of load balancing. However, in fact, the listening process of each listening thread needs CPU consumption. If there is only one CPU core, even if it is reused, the advantage of reuse will not be reflected, but the performance will be reduced because of switching the listening thread. Therefore, to use port reuse, you must consider whether each listener process/thread is isolated in its own CPU, that is, whether it is reused, how many times it is reused depends on the number of CPU cores, and whether the process is bound to the CPU.

So that’s it for now.