First thumb up and then watch, form a good habit

preface

It was interesting to see a question about Tomcat in the Q&A area these days. Just because I haven’t thought about it before, today I will talk about the “why” with the Tomcat mechanism.



This article on the HTTP protocol file upload standard and Tomcat mechanism analysis content, relatively basic, do not need to skip directly to the end of the article.

File upload in the HTTP protocol

As we all know, HTTP is a text protocol, but how does a text protocol transfer files?

Direct transmission… Yes, it’s that simple. Text protocol is only at the application level, at the transport level is all data is bytes, no difference, do not need to carry on the additional codec.

Multipart/form – the data

HTTP protocol provides a form-based File Upload method. Define an ENCTYPE attribute in the form with the value multipart/form-data, and add a tag with the type file.

 <FORM ENCTYPE="multipart/form-data" ACTION="_URL_" METHOD=POST>

   File to process: <INPUT NAME="userfile1" TYPE="file">

   <INPUT TYPE="submit" VALUE="Send File">

 </FORM>

This multipart/form-data form is a little different from the default x-www-form-urlencoded form. Although both serve as forms and can upload multiple fields, the former can upload files and the latter can only transfer text

Now let’s look at the protocol for uploading a form file. Here is a simple multipart/form-data request:



As you can see from the image above, the HTTP header part is the same, just a boundary tag is added to the content-type, but the payload part is completely different. Right

Boundary is used in multipart/form-data to separate multiple fields of the form. In the payload section, there is one boundary between the first and last lines and one between each field (part/item)

When the Server side reads, it only needs to get the boundary from the content-type first, and then split the payload part through the boundary to get all the fields.

In each field’s packet, there is a Content-Disposition field that serves as the Header section of that field. It records the name of the current field, a filename attribute in the case of a file, and the next line is accompanied by a Content-Type to identify the Type of the file

Although both x-www-form-urlencoded and multipart forms do the job of transferring fields, multipart can transfer files as well as text fields. The multipart file transfer method is also “standard”, which can be supported by various servers, and can read files directly.

X-www-form-urlencoded can only transmit basic text data. However, if you force files to be text, no one can stop you from using this type, but as text transmission, the back end is bound to be parsed in the way of string, and the coding overhead of byte-> STR is completely unnecessary. And it can lead to coding errors…

In x-www-form-urlencoded types of messages, there is no boundary and multiple fields pass through&Symbol concatenation, and Urlencode for key/value



Although X-WWW-form-urlencoded process is added in one step, header and boundary will not be added to each field, and the volume of the message is much smaller than that of multipart method.

In addition to this multipart, there is another form of uploading files directly, which is less common

Binary payload way

In addition to multipart/form-data, there is a binary payload upload method. The binary payload is a name I made up myself… There is no specification for this in the HTTP protocol, but many HTTP clients support it.

Such as the Postman:



Such as OkHttp:

OkHttpClient client = new OkHttpClient().newBuilder()
  .build();
MediaType mediaType = MediaType.parse("image/png");
RequestBody body = RequestBody.create(mediaType, "<file contents here>");
Request request = new Request.Builder()
  .url("localhost:8098/upload")
  .method("POST", body)
  .addHeader("Content-Type", "image/png")
  .build();
Response response = client.newCall(request).execute();

This is a very simple way of storing the entire payload of the file. As shown in the image below, the entire payload is the contents of the file:

It’s simple, and the client implementation is simple, but… The server is not well supported. For example, in Tomcat, this binary file is not treated as a file, but as a normal packet.

Tomcat processing mechanism analysis

When processing a text message, Tomcat reads the Header section first and parsed the Content-Length to demarcate the message boundary. The remaining Payload is not read at once, but wrapped in an InputStream. Internally call Socket read to read RCV_BUF data (When the full packet Size is larger than readBuf Size)

To call it the getParameter/getInputStream part involves the content read operations, will conduct InputStream internal Socket RCV_BUF read, Read Payload’s data.

Instead of reading all the data at once and temporarily storing it in memory, wrapping an InputStream inside reads RCV_BUF instead of storing the data, just a wrapper. The application layer’s read operation on ServletRequest# InputStream is forwarded to the read operation on Socket RCV_BUF.

However, if the application layer reads the ServletRequest# InputStream completely, then converts the string and stores it in memory, then it has nothing to do with Tomcat.

For requests of type multipart, Tomcat has a special handling mechanism. Since multipart is designed for transferring files, Tomcat adds a concept of staging files when handling this type of request, with the data in the multipart written to disk when parsing the message.

As shown in the figure below, Tomcat wraps each field as a DiskFileItem –org.apache.tomcat.util.http.fileupload.disk.DiskFileItemThis DiskFileItem does not distinguish between file and text data. DiskFileItem is divided into Header section and Content section. Part of the Content is stored in memory and the rest is stored on disk, which is split through a sizeThreshold.However, this value defaults to 0, which means that the entire content portion is stored to disk by default.



So if you’re storing it to disk, you must be reading it from disk… Naturally, the efficiency is relatively low. So if it’s just a text message, don’t use multipart. This type will be saved to disk.

As a bonus, when Tomcat handles multipart packets, if a field is not a file, it adds the field’s key/value to the ParameterMap. That is to say, by the request. The getParameter/getParameterMap fields can get these files.

//org.apache.catalina.connector.Request#parseParts

if (part.getSubmittedFileName() == null) {
    String name = part.getName();
    String value = null;
    try {
        value = part.getString(charset.name());
    } catch (UnsupportedEncodingException uee) {
        // Not possible
    }
    ......
        parameters.addParameter(name, value);
}

The getParameter can only fetch form parameters (FormParam) and query parameters (QueryString), but the multipart is also a form, so there is nothing wrong with getting arguments.

A quick summary

Tomcat handles different types of requests:

  1. If the arguments are in the GET queryString form, then all the arguments are in the header and are read to memory all at once
  2. If it is a POST message, Tomcat will only read the Header part. Instead of reading the Payload part, the Socket will be wrapped as an InputStream for the application layer to read

    1. X-www-form-urlencoded messages will not be read actively, but many Web frameworks (such as SpringMVC) will call getParameter, or will start the read of InputStream to read RCV_BUF
    2. The same is true for the binary payload. Tomcat does not initiate a read operation. Instead, the application layer calls ServletRequest#InputStream to read the RCV_BUF data
    3. Multipart messages will not be read actively. A call to HttpServletRequest#getParts will trigger parsing/reading. Similarly, many Web frameworks call getParts, so parsing is triggered

      Why write to a temporary file first, just wrap the InputStream and give it to the application layer?

      If the application layer does not read RCV_BUF, then when the received data is full of RCV_BUF, there will be no ACK returned. The client’s data will also be stored in SND_BUF and cannot be sent any more. When SND_BUF is full, the connection will be blocked.



The following reasons are personal opinions, not supported by official literature. If you have different opinions, please leave a comment in the comment section

Because multiparts are generally used to transfer files, the file size is usually much larger than the Socket Buffer capacity. So, in order not to block the TCP connection, Tomcat reads the entire Payload Part at once and stores all of its parts to disk (the headers are in memory, the contents are on disk).

The application layer only needs to read the Part data from the Tomcat DiskFileItem, so it looks as if the data in RCV_BUF can be consumed in a timely manner despite the transfer of one layer.

From the efficiency point of view, the transfer + save disk operation, must be much slower than not transfer, but can be timely consumption of RCV_BUF, ensure that the TCP connection is not blocked.

If multiple requests using the same TCP connection are multiplexed over HTTP2, it will also cause all “logical HTTP connections” to block if RCV_BUF is not consumed in time

So why don’t other types of packets be staged on disk?

Because the message is small, the ordinary request message will not be too large, the common is only a few K to dozens of K, and for the pure text message, the read operation must be timely and all read at one time, and the form of multipart message is different, it is a mixed way of text + file, but also may be multiple files.

For example, after the server receives the file, it also needs to save the file to the object storage service of some cloud vendor. Then there are two ways to save the file:

  1. The full file data is received, stored in memory, and the object store SDK is called
  2. To stream, read ServletRequest#InputStream while writing to the SDK’s OutputStream

In Mode 1, although the RCV_BUF is Read in time, it takes up too much memory, which is easy to burst the memory, which is very unreasonable. In Mode 2, although the memory is small (only the size of one Read Buffer at most), the RCV_BUF cannot be consumed in time because of the network on both sides of the RCV_BUF.

Also, not only Tomcat, but Jetty also handles multipart in this way. Other Web servers have not looked at it, but I think they all handle multipart in this way.

reference

  • Apache Tomcat
  • Form-based File Upload in HTML – IETF
  • Tomcat Architecture Analysis by Liu Guangrui

Original is not easy, prohibit unauthorized reprint. If my article is helpful to you, please feel free to support me at thumb up/bookmark/follow me