Wen | Zeng Ke (Name: Yisi)

Senior engineer of Ant Group

In charge of the access layer construction of Ant Group, the main direction is the design and optimization of high-performance secure network protocol

Read this article with 10279 words in 18 minutes

The introduction PART. 1

As the first in a series of articles, the introduction will be a little more complicated to give you a simple understanding of this series of articles.

A little background on the genesis of this series. QUIC, HTTP/3 and other words seem to be familiar to everyone. From a personal perspective, most developers already have some background knowledge, such as the core of HTTP/3 is the ability to rely on QUIC to implement the transport layer and TLS layer. When it comes to the details of HTTP/3, very little is known, and most of the articles have only scratched the surface of some of the mechanisms and features in HTTP/3, with very little in-depth analysis, especially of the reasons behind these mechanisms and design ideas.

From my limited experience in RFC reading and draft writing, JUST like writing papers and literatures, IN order to ensure the simplicity and accuracy of an RFC, of course, it is also for the simplicity of the writing process. When it comes to other related protocols, the authors usually express them by direct reference. This means directly by reading the RFC to learn and understand the network protocol is a process of a curve is relatively steep, tend to readers in the reading to a key part of the time, have to jump to other documents, then repeat this process of headache, and when the reader went back to the original document may have been forgotten before the context of what it is.

HTTP/3 involves QUIC, TLS, HTTP/2, QPACK and other standard documents, and these standard documents have a large number of associated documents respectively, so it is not an easy thing to learn.

Of course, the title of this series of articles is “Deep HTTP/3”, not “Deep QUIC”, because HTTP/3 is not just QUIC, but also contains a number of organic combinations of existing HTTP protocols and QUICS. This part will also be analyzed in depth at length in the subsequent articles of this series.

The excellent performance of a protocol depends not only on its own design, but also on a large number of engineering practices such as software and hardware optimization, architecture implementation, and special design. Therefore, this series will share not only the characteristics of HTTP/3 itself, but also the scheme of HTTP/3 landing in Ant.

The last part of the introduction is also the formal beginning of this paper.

According to statistics, when people learn new knowledge, they are more accustomed to analogy and inference from existing knowledge, so as to produce a more profound perceptual and rational understanding. I think for most of you, “Why does TCP have three handshakes and four waves?” This question, quite a bit classic can not be more classic flavor, so today this article will also start from the QUIC link establishment process and close process, the first article in our series.

PART. 2 Link establishment

2.1 review TCP

“Why does TCP need three handshakes?”

Before answering this question, we need to understand the nature of TCP. TCP is designed to be a connection-oriented, reliable, full-duplex transport layer communication protocol based on byte streams.

“Reliable” means that when TCP is used to transmit data, if TCP returns a success message, the data must have been successfully transmitted to the peer end. To ensure reliable data transmission, we first need an acknowledgement mechanism to confirm that the peer end has received the data, which is the familiar ACK mechanism.

Streaming is an abstraction of usage (the sender and receiver do not care about the underlying transmission, but simply send and read data as a continuous stream of bytes). The use of “flow” strongly depend on the data transmission, to the use of abstraction, we need a mechanism to ensure data and orderly, the design of the TCP stack is to send each byte identifies its corresponding seq (seq in practical application is a range, but the real effect is done every byte order), The receiving end checks the SEQ of the received data and compares it with the current SEQ of the peer end recorded by itself to confirm the order of the data.

“Full-duplex” means that the sending and receiving process at one end of the communication is reliable and streaming, and the receiving and sending are completely independent of each other.

As can be seen, these features of TCP are realized by using SEQ and ACK fields as carriers, and all TCP interaction flows are in the service of the above features, of course, the three-way handshake is no exception. Let’s look at the diagram of TCP three-way handshake.

To ensure that both parties can confirm the sending order of data at the peer end, the receiving and sending ends need to record the current SEQ of the peer end and confirm that the peer end has synchronized its own SEQ. To ensure this process, at least three RTTS are required. The actual implementation puts seQ and ACK together in a message for efficiency, which is known as the three-way handshake.

Of course, the three-way handshake not only synchronizes SEQ, but can also be used to verify that the client is a normal client, such as TCP, which may face these problems:

(1) some TCP attack request, only send SYN request, but do not return data, waste socket resources;

(2) The invalid connection request message segment is suddenly sent to the server, and there is no follow-up response for these data. How to prevent such request from wasting resources?

And these are just the problems of the three-way handshake, not the three-way handshake designed for them.

If we agree that the seQ of both client and server starts from 0(or some fixed value that everyone knows), then we don’t need to synchronize seQ. So there’s no need to shake hands three times? Can you just start sending data?

Of course, the protocol designers must have thought about this solution, but why not? We’ll look at the problems TCP faces in the next chapter.

2.2 TCP Problems

2.2.1 seq attack

In the previous section, we mentioned that TCP relies on SEQ and ACK to implement reliable, streaming, and full-duplex transmission modes. However, in reality, it needs to synchronize the seQ of both ends with a three-way handshake. If we had agreed on the initial SEQ of the communication parties in advance, we could have avoided the three-way handshake. The answer is security.

As we know, TCP data is not protected by any security, no matter its header or payload. An attacker can forge a legitimate TCP packet anywhere in the world.

A typical example is that an attacker can forgery a RESET message to forcibly close a TCP connection. The key to a successful attack is the SEQ and ACK fields in the TCP field. As long as these two fields are in the sliding window of the receiver, the packet is valid. The TCP handshake uses a random SEQ (not completely random, but linearly increases over time and then rolls back at the end of 2^32) to make it harder for an attacker to guess SEQ and increase security.

For this reason, TCP also has to do three handshakes to synchronize their respective SEQs. Of course, this method has some effect on off-path attackers, but is completely invalid on on-path attackers. An on-path attacker can reset links at will, even forge packets, and tamper with user data.

Therefore, although TCP has made some efforts for security, but because it is only a transport protocol in nature, security is not its original consideration, in the current network environment, TCP will encounter a lot of security problems.

2.2.2 Inevitable Data Security Problems

SSL/TLS/HTTPS/SSL/TLS/HTTPS

For example, a user may tolerate a failed transfer, but he certainly cannot tolerate money being transferred to an attacker. The emergence of TLS provides a mechanism for users to ensure that middlemen cannot read and tamper with the PAYLOAD data of TCP. TLS also provides a set of secure authentication system to prevent attackers from impersonating Web service providers. However, the TCP header layer is still not protected, and an ON /off-path attacker can theoretically close the TCP connection at any time.

2.2.3 Efficiency problems caused by security

In the current network environment, secure communication has become the most basic requirement. Those familiar with TLS know that TLS also requires handshake and interaction. Although TLS protocol has been designed and implemented a large number of optimization methods (such as TLS1.3, session reuse, PSK, 0-RTT and other technologies) after years of practice and evolution, due to the layered design of TLS and TCP, In fact, the establishment of a secure data channel is still a relatively tedious process. Take the process of creating a data security channel based on TLS1.3 protocol as an example. The detailed interaction is shown as follows:

As you can see, three RTT interactions are required before a client can actually start sending application-layer data, which is a very large overhead. From the perspective of process, TCP handshake and TLS handshake seem to be similar, there is a possibility of fusion. There was some literature on the feasibility of fusing ClientHello in SYN packets, but that effort fizzled out for several reasons.

  1. TLS itself is based on ordered transport design protocol, fusion in TCP needs to do a lot of redesign;

  2. For security reasons, TCP’s SYN packet is designed to carry no data. To carry Clienthello requires a lot of changes to the protocol stack. Since TCP is a kernel stack, changes and iterations are painful and difficult to land.

  3. The new protocol is not compatible with traditional TCP, and the possibility of large-scale use is also low.

2.2.4 TCP Design Problems

Due to the historical background of TCP design, the network situation at that time was not so complex as it is now. The bottleneck of the entire network was bandwidth, so the field design of the whole TCP was very simplified. However, the effect was to design the control channel and data channel coupling together, which would cause problems in some scenarios.

Such as:

Ambiguity of SEQ: In this scenario, the sender sends a TCP packet. As a result, the packet is delayed to be forwarded because the intermediate device of communication is blocked. The sender sends a new TCP packet before receiving an ACK. The receiver will only respond with an ACK. When the client receives an ACK, it is not clear whether the ACK is for a delayed packet or a new packet. The RTT estimation is inaccurate, which affects the behavior of the congestion control algorithm and reduces the network efficiency.

Difficult TCP Keepalive: For example, the other TCP connection is suddenly disconnected due to power failure. In this case, data fails to be retransmitted. Keepalive packets have a higher priority than keepalive packets, so keepalive packets cannot be sent. Only after a long retransmission failure can we determine that the connection is down.

Header blocking: This is not strictly a problem with TCP itself, because TCP is a linkoriented protocol that ensures reliable transmission of data over a link and gets the job done. However, with the popularization of the Internet, more and more data are transmitted through the network. If all data are transmitted over a TCP link, packet loss occurs in one of the data, and subsequent data transmission will be blocked, seriously affecting the efficiency. Of course, using multiple TCP links to transfer data is one solution, but multiple links introduce new overhead and link management issues.

Understanding these problems with TCP allows us to delve into the complex mechanics of QUIC and see where its design came from.

2.3 QUIC connection design

Like TCP, QUIC’s primary goal is to provide a reliable, ordered streaming protocol. Not only that, but QUIC also ensures native data security and efficient transmission.

It can be said that QUIC is benchmarking TCP+TLS with a more concise and efficient mechanism. Of course, like TCP+TLS, the essence of the QUIC connection process is to serve the above features. Since QUIC is a protocol redesigned based on UDP, it does not have so much historical burden. Let’s first sort out our appeal to this new protocol:

With the requirements sorted out, let’s take a look at what QUIC does.

First, take a look at the establishment process of a QUIC link. A rough sketch of the establishment of a QUIC link is as follows:

It can be seen that compared with TCP+TLS, QUIC only needs 1.5 RTT to complete the link establishment, which greatly improves the efficiency. If you are familiar with TLS, you may find that QUIC’s linking process is not much different from TLS handshake. TLS itself is a protocol that strongly relies on orderly and reliable data transmission. However, QUIC relies on TLS to achieve orderly and reliable data transmission, which seems to be a chicken and egg problem. So how does QUIC solve this problem?

We need to take a deeper look at the QUIC connection establishment process. The rough sketch can only give us a rough sense of the efficiency of QUIC compared to TCP+TLS process. Let’s take a further look at the more refined QUIC connection establishment process:

The diagram here is a bit tedious, and despite the details of the TLS handshake (QUIC’s TLS design will be covered in a follow-up article in this series), the whole process is actually a request-response pattern like TCP. However, compared to TCP+TLS, We also saw something different:

1. There are more “Init packet”,” Handshake packet” and “short-header packet” in the picture.

2. There are more concepts of PKt_number and stream+offset in the figure;

3. The pkt_number subscript changes seem odd.

These different mechanisms are where A QUIC implementation is more efficient than TCP, so let’s look at each of them.

2.3.1 Pkt_number design

Pkt_number: pkt_number: pkt_number: pkt_number: pkt_number: pkt_number: pkt_number:

  1. Subscript starting at 0

As mentioned earlier, if TCP’s SEQ is a zero-based field, then there is no need to shake hands, so the chicken-and-egg solution to TLS and orderly reliable transport is very simple. That is, pkt_number starts from 0 to ensure the order of TLS data.

  1. Encrypts pkt_number for security

Of course, pkt_number from 0 technology will also meet the same security problems as TCP, the solution is also very simple, is used for PKt_number encryption, pkt_number encryption, middleman will not be able to obtain the key of the encrypted PKt_number, The real PKT_number cannot be obtained, and the subsequent data transmission cannot be predicted by observing PKT_number. TLS requires a handshake to obtain a key that the middleman cannot obtain. Pkt_number exists before the TLS handshake. Save this for the quIC-TLS feature article.

  1. Design of fine-grained PKT_number space

TLS is not strictly a protocol for state progression. Each new state may receive data from the previous state, which is a bit abstract.

TLS1.3, for example, introduces a 0-RTT technique that allows clients to send some application layers at the same time when initiating a TLS request via clientHello. Of course, we expect the process of application layer data to be asynchronous and non-interfering with the handshake process, and if they are all marked with the same PKT_number, packet loss of application layer data will inevitably affect the handshake process. Therefore, QUIC designed three different PKT_number Spaces for the state of the handshake:

(1) init;

(2) Handshake;

(3) the Application Data.

Corresponding to:

(1) Transmission of plaintext data in TLS handshake, namely init packet in the figure;

(2) Handshake data transmission encrypted by Traffic Secret in TLS, namely handshake packet in the figure;

(3) Data transmission and 0-RTT data transmission at the application layer after the handshake is completed, namely, the short Header packet in the figure and the 0-RTT packet not drawn in the figure.

The three different Spaces ensure that the packet loss detection of the three processes does not affect each other. This will be covered in more detail later in the series on QUIC packet loss detection.

  1. The ever-increasing PKT_number

Permanent increment here refers to that the plaintext of pkt_number increases by 1 with each QUIC packet. The self-increment of pkT_number resolves the ambiguity problem. After receiving the ACK corresponding to pkt_number, the receiver can clearly know whether the retransmitted packet is ACK or the new packet is ACK. In this way, the RTT estimation and packet loss detection can be more precise. However, self-increasing PKT_number alone is not enough to ensure that the data is ordered. Let’s take a look at the mechanism provided by QUIC to ensure that the data is ordered.

2.3.2 Orderly transmission based on stream

We know that QUIC is based on UDP, and UDP is an unreliable packet-oriented protocol. This is not fundamentally different from TCP based on IP layer.

(1) The bottom layer is only responsible for packet transmission as a unit;

(2) The upper layer protocol to achieve more critical features, such as reliability, order, security and so on.

As we know from the previous section, TCP’s design leads to queue blocking in the link dimension, and pkT_number alone cannot order the data, so QUIC needs a more fine-grained mechanism to solve these problems:

  1. Flow is both an abstraction and a unit

The root cause of TCP packet header blocking is that one link has only one sending stream, and the blocking of any packet on the stream will lead to the failure of other data. Of course, the solution is not complicated, we just need to abstract multiple flows on a single QUIC link, the overall idea is as follows:

By ensuring that each stream is sent independently, we can effectively avoid queue header blocking of the QUIC link itself, i.e. we can send data on other streams if one stream is blocked.

With the single-link, multi-stream abstraction, let’s look at the transport ordering design of a QUIC. In fact, a QUIC has a more fine-grained unit, called a frame, above the stream level. A frame that holds data carries an offset field that indicates its offset with respect to the original data. The initial offset is 0, which is equivalent to pkT_number equals 0, i.e. no handshake is required to start sending data. If you are familiar with HTTP/2 or GRPC, you should know the design of the offset field, which is the same as streaming data transmission.

  1. A TLS handshake is also a stream

Although TLS data is not identified by a fixed stream, it can be treated as a specific stream, or the initial stream that all other streams can be built on, because it is also hosted by an offset field and a fixed frame. This is the guarantee of TLS data order.

  1. Frame based control

With the frame layer of abstraction, we can certainly do more. In addition to carrying the actual data, it can also carry some control data. The design of QUIC has learned from the experience and lessons of TCP. For keepalive, ACK, stream and other operations, special control frames are set, which realizes the complete decoupling of data and control. At the same time, stream-based fine-grained control is guaranteed, making QUIC a more refined transport layer protocol.

At this point, we can actually see that the goal of QUIC design has been clarified from the discussion of the QUIC linking process, just as the concept has been emphasized in the article:

“No matter what the process is, it serves to realize the features of QUIC.”

Now that we have some knowledge of the features and implementation of QUIC, let’s summarize:

At this point, we take a look at the QUIC built some of the process of design, don’t be obsessed with its complex process, to direct the nature of it, because the framework is designed to build the general assembly in QUIC confirm down after the minutiae points, and these points are often will occupy a big space in the RFC, consume the reader’s heart.

An example is the amplification attack on QUIC and how it is handled: The amplification attack is based on the fact that there is very little clientHello data during the TLS handshake. The Server could respond to A lot of data and this could lead to A scale-up attack, such as calling A massive clientHello but changing its SRC IP to Client B, which would trigger the ATTACKER to multiply its traffic. QUIC requires the padding of each client’s first packet to be a certain length, provides address validation on the server side, and limits the size of the server’s response until the handshake is complete.

The RFC9000 has a chapter devoted to this mechanism, but it is essentially a fix for QUIC’s current handshake process, not a redesign of the handshake process to design the mechanism.

PART. 3 Link closed

Graceful closing of QUIC links from TCP

Link closure is a simple appeal that can be simplified into two goals:

1. Users can take the initiative to gracefully close the link, and can notify each other, release resources.

2. When the communication end cannot establish a link, there is a mechanism to notify the other party.

However, the appeal is simple, but the implementation of TCP is not simple. Let’s take a look at the transfer diagram of the TCP connection closing process state machine:

This process may seem complicated enough, but it brings up even more questions, like the classic interview question:

“Why do I need TIME_WAIT?

“How can I handle the excessive number of TIME_WAIT connections?”

“What is the difference between tcp_TW_reuse and tcp_TW_recycle?”

The root cause of all these problems is the binding of TCP links and streams, or the coupling of control signaling and data channels.

We can’t help but ask the soul question “What we need is a full-duplex data transfer mode, but do we really need to do this in the link dimension?” TCP TIME_WAIT: TCP TIME_WAIT: TCP TIME_WAIT: TCP TIME_WAIT

Back to our question, if we separate the flow from the link, ensure the reliable transmission of the control instructions of the flow in the link dimension, and the link itself realizes a simple simplex closure process, that is, the communication end actively closes the link, then the whole link is closed, will everything be simple?

Of course, this is the solution of QUIC, with this layer of thinking, let’s sort out the appeal of QUIC link closure:

Instead of closing flow design for streams (this section on streams will be shared on stream design later in this series), we have a clean state machine in the link dimension:

As can be seen, thanks to the simplex closing mode, in the whole process of QUIC link closing, there is only one closing instruction, namely CONNECTION_CLOSE, and only two closing states, namely closing and DRAing, in the figure. Let’s first look at the terminal’s behavior in two states:

  • Closing state: When the user actively closes the connection, the terminal enters the closing state. In this state, the terminal responds only with “CONNECTION_CLOSE” for all data received at the application layer

  • Agriculture: Specifies the state that goes into when a CONNECTION_CLOSE is received. Any data is not answered immediately

More simply, CONNECTION_CLOSE is an instruction that does not need to be ACK, which means no retransmission is required. As far as the link dimension is concerned, we only need to ensure that the last link is successfully closed and that the new link is not affected by the old close directive. This simple CONNECTION_CLOSE directive does all the work.

3.2 A safer way to reset

Of course, there are many cases of link closure. Like TCP, in addition to the mode of active link closure mentioned in the previous section, QUIC also needs to provide the ability to directly reset the peer link when it is unable to reply to the response.

Compared with TCP, QUIC reset is more secure for peer connection. This mechanism is called stateless reset, which is not a very complicated mechanism. After the QUIC link is established, the two QUIC parties will synchronize a token, and the subsequent link closing will verify the token to determine whether the peer end has the permission to reset the link. This mechanism fundamentally avoids the attack mode of malicious RESET of TCP mentioned above. The whole process is shown as follows:

Of course, Stateless Reset is not a silver bullet, and the cost of safety is a narrower range of use. To ensure security, the token must be transmitted over a secure data channel (restricted to NEW_TOKEN frame in QUIC), and the receiver must maintain a state that records the token. Only this state can guarantee the validity of the token.

Therefore, stateless reset is limited to QUIC link closing as a last resort, and can only work when both client and server are in a relatively normal situation. For example, stateless reset is not applicable. The server is not listening on port 443, but the client is sending data to port 443, which can be RST dropped under the TCP stack.

3.3 Time-out disconnection of engineering considerations

Keepalive mechanism itself does not have any tricks, just a timer and probe messages can be done, but QUIC benefits from the separation of links and data flow, it is very simple to close links, keepalive becomes easier to use. QUIC provides a control instruction called PING Frame on the link dimension for active probe preservation.

More simply, QUIC closes a link silently after a timeout by silently close, which means the machine directly releases all resources linked without notifies the peer. The benefit of silently Close is that resources can be released immediately, especially for QUIC protocols that have a single chain connection to maintain both TLS state and stream state, it has great benefits.

However, its disadvantage is that if the previous link data comes later, it can only notify the peer end to close the link through the stateless reset, which is relatively more expensive to close CONNECTION_CLOSE. Therefore, this part can be said to be a tradeoff completely, and the final decision of the design scheme of this part comes from a lot of experience and results of engineering practice.

PART. 4 See the evolution of protocol from the establishment and closure of QUIC link

From TCP to QUIC, although only the evolution of network protocol technology, but we can also have a glimpse of the trend of the development of the entire network. Link establishment and closed just our starting point for QUIC agreement, as has been emphasized in part of the article, whether to build or what process, are in service, for the realization of the characteristics of QUIC and established in this paper, in addition to the links for detailed analysis QUIC closing process, especially on the origin of these features and design ideas.

Read the full text, we can see that a modern network protocol has not been able to bypass the appeal of security, can say that “security is the basis of everything, efficiency is the eternal pursuit”.

However, QUIC firstly started from the idea of convergent layered protocol and unified the two interactive demands of security and reliability, which seems to remind us that the development of future protocols does not need to completely follow the OSI model. Stratification is for better division of labor among components, while convergence is the pursuit of extreme performance. TCP+TLS can converge to QUIC. Then, just like the NEW IP technology proposed by Huawei, if we combine technologies such as intelligent routing, All layer 3 and above network protocols are also convergent to a new IPSEC protocol.

Of course, all of these come too far. QUIC itself is a very grounded protocol. In the process of forming the standard dominated by open source, it has absorbed a lot of engineering experience, so that it does not have too many idealized features, and has very strong scalability.

“Language

Writing this paper is a painful process, just like reading RFC, it is almost impossible to completely include QUIC technologies in one article, but this paper chooses to express some other dependent technologies in a weak way, and expects to introduce them in a single article in the future.

Therefore, if you want to fully understand HTTP/3 or QUIC, please keep an eye on the following articles to get a better understanding.

Of course, this article is based on the author’s own personal understanding, inevitably there are flaws, if readers find relevant questions, welcome to discuss at any time.

Recommended Reading of the Week

Cloud native runtime for the next five years

Short Steps to a Thousand Miles: A review of QUIC Agreement landing in Ant Group

Service link isolation technology and practice based on ServiceMesh technology

Exploration and practice of Service Mesh in INDUSTRIAL and Commercial Bank of China