See the mystery of %20 character, percent code and behind it

preface

%20: %20: %20: %20: %20: %20: %20: %20: %20: %20: %20 So we crack it together. How do we code it?

Today we continue to learn about front-end coding, other coding articles:

Front-end Base64 coding knowledge, a dozen, explore the origin, the pursuit of the truth.
LocalStorage soul five questions. 5 m?? 10M !!!
6 ways to express the letter A, and the coding knowledge behind it

And then I’ll make it up

UTF – 16 coding
Utf-8 encoding

The basic coding knowledge system required by the front end is basically formed.

More advanced knowledge of front-end basics, yes

Pay attention to the front end of the column,
Follow public accountThe application world of the cloud.
In communication groupdirge-cloud

Unicode Basics

Unicode is just a character set that provides a number for each character, which we call a code point.

There are three encodings that Unicode can use:

Uft-8: A variable-length encoding scheme that uses 1 to 6 bytes for storage. Utf-16: For characters with code points less than 0xFFFF(65535), two bytes are stored, and vice versa. Uft-32: A fixed-length encoding scheme that always uses four bytes for storage, regardless of character number size.

Therefore, UTF-8 and UTF-16 are variable-length encoding schemes, while UTF-32 are fixed-length encoding schemes.

The advantage of fixed-length encoding schemes is simplicity, but the disadvantage is that they take up space, which is why utF-16 and UTF-8 are required.

We use UTF-8 for network transport, and utF-16 for javascript runtime character encoding.

`% 20`How to come

Let’s see how we can get this %20:

escape("")              "% 20"
encodeURI("")           "% 20"
encodeURIComponent("")  "% 20"
Copy the code

This is the hexadecimal format value of a character, a percent code, more on that later.

How do YOU get this code, write a simple method you can understand

function to16Format(ch){
    return The '%' + ch.codePointAt(0).toString(16)
}

to16Format("")  / / "% 20","
Copy the code

While all three methods get the same value, few people tell you that ESACPE is based on UTF-16 and the other two are based on UTF-8. Here’s an example:

0-0xFF code point range coding results are consistent, above 0xFF, the results are not the same, principle we will say later.

escape("")         / / % 20
encodeURI("")      / / % 20

escape("People")       // "%u4EBA" 
encodeURI("People")    // "%E4%BA%BA"

escape("𣑕")       // %uD84D%uDC55
encodeURI("𣑕")    // "%F0%A3%91%95"
Copy the code

To summarize:

Escape, encodeURI, and encodeURIComponent encode Spaces""All can get20%
Escape encodes utF-16, while the latter two encodes UTF-8, just code points0xFFThe following codes are the same

Of course, not all characters will be encoded, so let’s see what characters won’t be encoded.

Which characters will not be encoded

%20, we have to mention the three sister pairs of our common encoding:

escape (unescape) Has been out of date
encodeURI (decodeURI)
encodeURIComponent (decodeURIComponent)

So let’s just separate out A minus Z, A minus Z, 0 minus 9, because none of these are encoded, and let’s see what characters are not encoded.

A series of	Reserved characters	coding
escape	`@ * _ + -. /`	UTF-16
encodeURI	`- _.! ~ * '(); , /? : @ & = + $#`	UTF-8
encodeURIComponent	`- _.! ~ * '()`	UTF-8

The escape of coding

In simple terms, escape generates new strings replaced by a hexadecimal escape sequence, in order to make them readable on all computers. The encoded result is %XX or %uXXXX. Use encodeURI or encodeURIComponent when you need to encode urls.

Highlight: encoding based on UTF-16

Utf-16 character encoding. For characters whose code point is greater than 0xFFFF, the encoding result is divided into high and low bits. CharCodeAt (0) can obtain high bits, and charCodeAt(1) can obtain low bits.

The character whose code point is greater than 0xFFFF

Escape to two %uXXXX

Let’s look directly at the code result:

var ch = String.fromCodePoint(0x23455);  / / "𣑕"
escape(ch)  // the '%uD84D%uDC55' code point is greater than 0xFFFF
unescape(escape(ch)) / / "𣑕"

ch.charCodeAt(0).toString(16).toUpperCase();  / / high
// 'D84D'
ch.charCodeAt(1).toString(16).toUpperCase();  / / low
// 'DC55'

Copy the code

Just look at the conclusion, which is consistent with charCodeAt’s logic. Both return utF-16 encoding for high and low bits.

The encodeURI coding

Because urls can only consist of standard ASCII characters, other special characters must be encoded. They will be replaced by a series of different characters representing the UTF-8 encoding. EncodeURI and encodeURIComponent are used for this purpose.

Note that encodeURI and encodeURIComponent use UTF-8 encoding.

Take a look at the code points and UTF-8 encoding, and the number of bytes required.

Unicode code point range (hexadecimal)	Decimal range	Utf-8 Encoding mode (binary)	The number of bytes
`0000 0000 ~ 0000 007F`	`0 ~ 127`	`0xxxxxxx`	1
`0000 0080 ~ 0000 07FF`	`128 ~ 2047`	`110xxxxx 10xxxxxx`	2
`0000 0800 ~ 0000 FFFF`	`2048 ~ 65535`	`1110xxxx 10xxxxxx 10xxxxxx`	3
`0001 0000 ~ 0010 FFFF`	`65536 ~ 1114111`	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	4

Let’s look at the herringbone first:

Gets its code point4eba

var codePoint = "People".codePointAt(0).toString(16)  // `4eba`
Copy the code

It is located in0000 0800 ~ 0000 FFFFThe format for1110xxxx 10xxxxxx 10xxxxxx, requires three bytes
EncodeURI, you can see it’s three%XX

encodeURI("People")  // %E4%BA%BA
Copy the code

Convert UTF8 to Binary Bits – Online UTF8 Tools to verify the encoding result

Final coding result: 11100100 10111010 10111010

(0b11100100).toString(16).toUpperCase()  // E4
(0b10111010).toString(16).toUpperCase()  // BA
(0b10111010).toString(16).toUpperCase()  // BA

encodeURI("People") // %E4%BA%BA => E4 BA BA 
Copy the code

Again, derive the word 𣑕

Code points are0x23455
0001 0000 ~ 0010 FFFFIs in the format of11110xxx 10xxxxxx 10xxxxxx 10xxxxxx, requires four bytes
EncodeURI, which consists of four%XXcomposition

encodeURI("𣑕")    // "%F0%A3%91%95"
Copy the code

Coding of encodeURIComponent

Why not get an encodeURIComponent if you have an encodeURI?

It is used to encode the parameter value after the address, which we usually call queryString.

Here’s an example:

var param = "http://www.yyy.com"; / / param as parameters
param = encodeURIComponent(param); 
var url = "http://www.xxxx.com?target=" + param; 
Copy the code

Same thing with the following? The following part of the empty key = ah&type =x, the key value pairs need encodeURIComponent encoding.

http://wwww.xxxyyy.com/ hahah? Empty key = ahhah &type=x
Copy the code

In fact, modern browsers encode their own code by default. You might as well post the address above to the browser:

application/x-www-form-urlencoded

Application/X-www-form-urlencoded (POST) is also coded.

Its coding rules:

The data is encoded as'&'Delimited key-value pairs simultaneously with'='Separate keys and values.
Non-alphabetic or numeric characters are percent-encoding

Let’s take a look at percent-encoding first.

percent-encoding

Percentage encoding (also known as percent encoding) is an encoding mechanism with 8-bit character encodings that have a specific meaning in the context of a URL. It is sometimes called URL encoding. The encoding consists of alphanumeric substitutions: “%” followed by the ASCII hexadecimal representation of the substitution character.

It is widely used for the main Uniform resource identifier/Uniform resource Locator set (URI), which includes urls and uniform resource names (URNs). It is also used to prepare application/ X-www-form-urlencoded media-type data, which is usually used to submit HTML form data in HTTP requests.

Characters allowed by urIs are reserved and unreserved. Reserved characters are characters that have special meanings, such as: slash characters used to delimit different parts of A URL (or URI); Unreserved characters do not have these special meanings. Percent encoding represents reserved characters as special character sequences.

Reserved characters

Reserved characters require encoding, such as :’ :’, ‘/’, ‘? ‘, ‘#’, ‘[‘, ‘]’, ‘@’, ‘! ‘,’ $’,’ &’, “‘”,’ (‘,’)’,’ *’,’ +’,’,’,’,’,’; ‘, ‘=’, and ‘%’ itself, with a space “”.

Percent – encoding encoding table see: percent – encoding | MDN

Unreserved character

It doesn’t need to be coded, it just needs to be used.

A-Z
a-z
0-9
- _. ~

Special character`""`.

When it is a URL, the encoding is converted% 20
Post submission (Application/X-www-form-urlencoded) is replaced by+

Can we use encodeURLComponent to encode values and keys?

The answer is no:

Percentage encodings that require encoding special characters are 20 (plus’ ‘)

: /? # [] @! $&'() * +,; = %Copy the code

EncodeURLComponent does not encode nine characters:

- _.! ~ *'()Copy the code

So you also need additional coding for: [‘! ‘”‘ “‘ (‘, ‘) ‘, ‘*’], how to calculate and, see the following code:

var percentChars = [':'.'/'.'? '.The '#'.'['.'] '.The '@'.'! '.'$'.'&'."'".'('.') '.The '*'.'+'.', '.'; '.'='.The '%'.' '];
var eURICChars =   [The '-'.'_'.'. '.'! '.'~'.The '*'."'".'('.') '];

var notInPChars = percentChars.filter(c= > eURICChars.includes(c));

console.log("notInPChars:", notInPChars);
// notInPChars: (5) ['!', "'", '(', ')', '*']
Copy the code

So, the complete code should look like this:

function encodeValue(val)
{
   var eVal = encodeURIComponent(val);
 
   // Handle characters that are not encoded by encodeURIComponent alone
   eVal = eVal.replace(/\*/g.'%2A');
   eVal = eVal.replace(/! /g.21 '%');
   eVal = eVal.replace(/\(/g.28 '%');
   eVal = eVal.replace(/\)/g.'% 29');
   eVal = eVal.replace(/'/g.'% 27');
 
   // Special handling of space characters
   return eVal.replace(/\%20/g.'+');
}
Copy the code

Content-Disposition: attachment; filename

When we return the file in the background, if we specify Content-Disposition: Attachment and set filename, the client can download the file directly upon receiving the request. The problem is that filename also needs to be encoded.

Reference MDN:

var fileName = 'my file(2).txt';
var header = "Content-Disposition: attachment; filename*=UTF-8''"
             + encodeRFC5987ValueChars(fileName);

console.log(header);
// outputs "Content-disposition: attachment; filename*=UTF-8''my%20file%282%29.txt"


function encodeRFC5987ValueChars (str) {
    return encodeURIComponent(str).
        // Note that although RFC3986 retains "!" , but RFC5987 does not
        // So we don't need to filter it
        replace(/['()]/g.escape). // i.e., %27 %28 %29
        replace(/\*/g.'%2A').
            // The following URI encoding is not required in RFC5987
            / / so for | ` ^ this 3 characters we can slightly improve the readability
            replace(/ %? :7C|60|5E)/g.unescape);
}

Copy the code

It’s a little different than percent-encoding, which is made clear in the comments. I mean, aren’t you tired of all those deals?

See registration, we can see RFC3986, RFC5987 and other agreements, let’s have a look.

RFC3986 ，RFC1738 ，RFC5987

RFC3986, RFC1738 is about THE URI encoding specification, RFC5987 is about the HTTP protocol file header field specification.

RFC3986 issued in 2005, the current standard. The document gives detailed advice on URL codec, pointing out which characters need to be encoded so as not to cause semantic changes in the URL, and explaining why these characters need to be encoded
RFC 1738 ’94. Same as above.
RFC5987 Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters. The character set and language encoding for the hypertext Transfer protocol (HYPERtext Transfer Protocol) header field parameters, the specification for the encoding of HTTP header strings.

You’ll find that a lot of code also handles the ~ symbol, and although the RFC3986 document states that no Url encoding is required for the tilde symbol, there are still many older gateways or transport agents.

Code that is compatible with RFC1738, such as the well-known qs library formats

Window. Btoa and window. Atob

Window.btoa can encode characters in Base64, window.atob can decode.

window.btoa("abcd")      // "YWJjZA=="
window.atob("YWJjZA==")  // "abcd"
Copy the code

But it functions as an ASCII string.

window.btoa("People")  
// Uncaught DOMException: Failed to execute 'btoa' on 'Window':
// The string to be encoded contains characters outside of the Latin1 range.

Copy the code

How do you solve it?

// ucs-2 string to base64 encoded ascii
function utoa(str) {
    return window.btoa(unescape(encodeURIComponent(str)));
}
// base64 encoded ascii to ucs-2 string
function atou(str) {
    return decodeURIComponent(escape(window.atob(str)));
}
Copy the code

Check it out. Perfect.

utoa("People")     //5Lq6
atou("5Lq6")   / /
Copy the code

So what’s the idea??

After encodeURIComponent converts characters to percentages UTF-8 bytes are stored as % XX, unescape converts them to the individual code points required by BTOA. Therefore, bTOA (unescape (encodeURIComponent (STR))) encodes the text as UTF-8 bytes and then encodes it as Base64.

You can use escape and unescape without them, but they must be used together. However, it is no longer standard UTF-8 to Base64.

They are playing with:

window.btoa(encodeURIComponent("I am Person A.")) 
// JUU2JTg4JTkxJUU2JTk4JUFGJUU0JUJBJUJBYQ==
decodeURIComponent(window.atob("JUU2JTg4JTkxJUU2JTk4JUFGJUU0JUJBJUJBYQ=="))
// I am a person
Copy the code

Standard BASE decoding has not obtained the correct result:

conclusion

%20 is the result of escape or URL encoding, corresponding to the null character “”. It’s also a percent code.
Escape converts strings into a hexadecimal escape sequence, which makes them readable on all computers. It’s obsolete. It’s useless now.
EncodeURI is URL encoding and does not handle the parameter part
EncodeURIComponent is also the main URL encoding used
- The parameter part of the URL
- The post data type is Application/X-www-form-urlencoded
- Attachment Filename
RFC3986 and RFC1738 are URL encoding protocols
RFC5987 is a specification for encoding HTTP transport header strings
Window. btoa and window.atob can only handle ASCII characters by default, but with encodeURIComponent and escape, they can handle arbitrary characters.

One last question: what is the relationship between percent code and escape, encodeURI, and encodeURIComponent?

Write in the last

Do not forget the original intention, gain, but not tired, if you think it is good, your praise and comment is the biggest motivation for me to move forward.

Please go to the technical exchange groupCome here,. Or add my wechat Dirge-Cloud and learn together.

reference

The escape (string) encodeURI encodeURIComponent escape, encodeURI, what is the difference between encodeURIComponent percent – encoding | MDN Percent – encoding | wikipedia Percent | Chinese wikipedia Converting to Base64 encoding in JavaScript without Deprecated ‘Escape’ call