The paper come Zhongjue shallow, must know this to practice

In the process of Web development, I believe we have encountered the scene of attachment download, among them, the file name of the browser after downloading the Chinese garbled code problem may once make you distressed.

If you search online, most of them will determine the type of browser by the UserAgent field in Request Headers, and do different processing according to different browsers, similar to the following code:

// MicroSoft Browser if (agent.contains("msie") || agent.contains("trident") || agent.contains("edge")) { // filename } // Safari else if (agent.contains(" Firefox ")) {// Filename else} // Safari else if (agent.contains(" Safari ")) } // Contains (agent.contains(" Chrome ")) {// Contains (" Chrome ")) {// Contains (" Chrome ")) {// Contains (" Chrome ")); Response. setHeader(" content-disposition ", "attachment; fileName=" + filename);

However, this code seems magical, so why does every browser handle it differently? Does it have to be compatible every time a new browser comes out? Isn’t there a unified standard to constrain these browsers?

With this in mind, I looked through the RFC documentation and came up with an elegant solution:

PercentencodedFilename: response.setHeader("Content-disposition", "Attachment; filename=" + percentEncodedFileName + "; filename*=utf-8''" + percentEncodedFileName);

After testing, this response header can be compatible with all the major browsers in the market, because it is the scope of the HTTP protocol, so the language is independent. As long as you set the response header according to this rule, you can solve the annoying Chinese garbled attachment name problem once and for all.

Next, the rep will take you through the RFC documentation to see how the response header is produced.

1. Content-Disposition

It all starts with RFC 6266, a document that introduces the Content-Disposition response header, which is not part of the HTTP standard, but is constrained in this document because it is so widely used. Its syntax format is as follows:

content-disposition = "Content-Disposition" ":"
                            disposition-type *( ";" disposition-parm )

     disposition-type    = "inline" | "attachment" | disp-ext-type
                         ; case-insensitive
     disp-ext-type       = token

     disposition-parm    = filename-parm | disp-ext-parm

     filename-parm       = "filename" "=" value
                         | "filename*" "=" ext-value

There are two disposition-types:

  • Inline represents the default processing and is typically displayed on the page
  • The Attachment representative should be saved locally and needs to be setfilenameorfilename*

Noticing filename and filename* in disposition-parm, the document states: The information here can be used for the filename to be saved.

The difference between the two is that the value of filename is not encoded, whereas filename* follows the encoding rules defined in RFC 5987:

Producers MUST use either the "UTF-8" ([RFC3629]) or the "ISO-8859-1" ([ISO-8859-1]) character set.

Since filename* is defined later and many older browsers do not support it, the documentation dictates that when both appear in the header field, the filename* is used and the filename is ignored.

At this point, the skeleton of the response header is in place, as shown in the excerpt [RFC 6266] :

 Content-Disposition: attachment;
                      filename="EURO rates";
                      filename*=utf-8''%e2%82%ac%20rates

Filename *=utf-8 “%e2%82%ac%20rates filename*=utf-8” %e2%82%ac%20rates The first part is the character set (UTF-8), the middle part is the language (not filled in), and the last % E2 %82% AC %20rates represents the actual value. The composition of this part is described in detail in RFC 2231. Section 4:

 A single quote is used to
   separate the character set, language, and actual value information in
   the parameter value string, and an percent sign is used to flag
   octets encoded in hexadecimal.

2.PercentEncode

PercentEncode is also called percent-encoding or URL encoding.

As mentioned earlier, filename* follows the encoding rules defined in [RFC 5987], which define the character set that must be supported in [RFC 5987] 3.2:

recipients implementing this specification
MUST support the character sets "ISO-8859-1" and "UTF-8".

And it is stated in [RFC 5987] 3.2.1 that percentage sign encoding follows the definition in RFC 3986. Section 2.1, excerpted below:

A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the
allowed set or is being used as a delimiter of, or within, the
component.  A percent-encoded octet is encoded as a character
triplet, consisting of the percent character "%" followed by the two
hexadecimal digits representing that octet's numeric value.  For
example, "%20" is the percent-encoding for the binary octet
"00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
character (SP).  Section 2.4 describes when percent-encoding and
decoding is applied.

Note that [RFC 3986] explicitly states that Spaces are encoded as %20 by a percent sign

However, in another document, RFC 1866.Section 8.2.1 The form-urlencoded Media Type, it is stipulated that:

The default encoding for all forms is `application/x-www-form- urlencoded'. A form data set is represented in this media  type as follows: 1. The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped as per [URL]

It is required that in application/x-www-form-urlencoded messages, Spaces will be replaced with +, and other characters will be escaped as defined in [URL]. The [URL] refers to RFC 1738 and the most recent document related to URLs in its revision is [RFC 3986].

This is why the percentage code for white space in many documents results in + or %20, as in:

w3schools:URL encoding normally replaces a space with a plus (+) sign or with %20.

MDN:Depending on the context, the character ' ' is translated to a '+' (like in the percent-encoding version used in an application/x-www-form-urlencoded message), or in '%20' like on URLs.

So the question is, what should we do with the percent encoding of Spaces in the development process?

The class representative suggests that you follow the latest documents, because the conditions defined in [RFC 1866] are only applicable to the application/x-www-form-urlencoded type, and we should follow [RFC 3986] for the definition of percent coding. Therefore, When to encode space to plus (+) or %20? When to encode space to plus (+) or %20? When to encode space to plus (+) or %20?

3. Code practices

With the theoretical basis, the code will be written naturally, directly on the code:

@GetMapping("/downloadFile") public String download(String serverFileName, HttpServletRequest request, HttpServletResponse response) throws IOException { request.setCharacterEncoding("utf-8"); response.setContentType("application/octet-stream"); String clientFileName = fileService.getClientFileName(serverFileName); String percentencodedFilename = URLEncoder. Encode (ClientFilename, "UTF-8 "). ReplaceAll ("\\+", "%20"); StringBuilder contentDispositionValue = new StringBuilder(); contentDispositionValue.append("attachment; filename=") .append(percentEncodedFileName) .append(";" ) .append("filename*=") .append("utf-8''") .append(percentEncodedFileName); response.setHeader("Content-disposition", contentDispositionValue.toString()); / / write file stream to the response of the try (InputStream InputStream. = fileService getInputStream (serverFileName); OutputStream outputStream = response.getOutputStream() ) { IOUtils.copy(inputStream, outputStream); } return "OK!" ; }

The code is simple, and there are two things that need to be explained:

  1. Why do you need.replaceAll(“\\+”, “%20”) after the urlencoder.encode (ClientFileName, “UTF-8 “) method?

    As mentioned earlier, we’ve made it clear that wherever percentage encoding is required, we should encode the space character as %20, and the description of the UrlenCoder class explicitly states that it converts the space character to +:

    The space character ”   ” is converted into a plus sign “{@code +}”.

    This is not to blame the JDK, because its notes indicate that it follows application/x-www-form-urlencoded(there is a function like this in PHP, and it does the same thing).

    Translates a string into {@code application/x-www-form-urlencoded} format using a specific encoding scheme. This method uses the

    So here we use.replaceAll(“\\+”, “%20”) to handle the + so that it fully conforms to the percent encoding specification of [RFC 3986]. All the operations are shown here for the sake of illustration. Of course, you can implement a percentEncoder class on your own, as much as you want.

  2. [RFC 6266] standardfilename=thevalueIt doesn’t need to be coded, right herefilename=Why is the “value” encoded with a percent sign?

    Reviewing [RFC 6266] documentation, take filename and filename* when they occur together, and take the former when the browser is too old to support the new standard.

    Mainstream browsers now use a self-updating strategy, so most of them support the new standard —— except for the older version of IE. The value processing strategy of the old version of IE is to decode and use the percent sign. So here, the value of filename= is encoded as a percentage sign, which is used to be compatible with older versions of IE.

    PS: The new standard has been supported by IE11 and Edge.

4. Browser testing

According to StatCounter’s statistics of browser market share in China in 2019, the class representative designed a filename download -down test.txt containing Chinese, English and space for testing

Test results:

Browser Version pass
Chrome 84.0.4147.125 true
UC V6.2.4098.3 true
Safari 13.1.2 true
QQ Browser 10.6.1 (4208). true
IE 7-11 true
Firefox 79.0 true
Edge 44.18362.449.0 true
360 secure browser 12 12.2.1.362.0 true
Edge(chromium) 84.0.522.59 true

According to the test results, it is basically compatible with all the major browsers on the market.

5. To summarize

In retrospect, this article was just a browser compatibility issue that caused the attachment name to be garbled. To solve this problem, I looked at two types of standard documentation:

  1. Standards related to HTTP response headers

    [RFC 6266], [RFC 1866]

  2. Coding standards

    [RFC 5987], [RFC 2231], [3986], [1738]

We take [RFC 6266] as the starting point, and there are altogether 6 [RFC] related documents quoted in the whole paper, and the sources are all indicated. If you are interested in the article, you can follow the ideas of the article and read the original document, I believe you will have a deeper understanding of this issue. The code has been uploaded to GitHub

Finally, we can not help but sigh: specification is really a good thing, it is just like the interface in the Java language, only make standards, the specific implementation left to everyone to play their own.

If you find this article helpful, please feel free to bookmark, share, or watch Sanlian

6. References

[1]RFC 6266: https://tools.ietf.org/html/r…

[2]RFC 5987: https://tools.ietf.org/html/r…

[3]RFC 2231: https://tools.ietf.org/html/r…

[4]RFC 3986: https://tools.ietf.org/html/r…

[5]RFC 1866: https://tools.ietf.org/html/r…

[6]RFC 1738: https://tools.ietf.org/html/r…

[7]When to encode space to plus (+) or %20? : https://stackoverflow.com/que…


August 19, 2020 Update: This article has been merged into the open source framework Zoe (Pull Request: 196)


👇 Follow the Java class representatives to get the latest Java articles