This is the 14th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021.

preface

ECMAScript 6.0, or ES6 for short, is a new generation of JavaScript standard. It was released in June 2015 and is officially known as ECMAScript 2015 Standard. General, generally refers to the standard after 5.1 version, covering ES2015, ES2016, ES2017, ES2018, ES2019, ES2020, ES2021 and so on

Let’s learn about string garble.

Why must it be � �

The range U+D800 to U+DFFF cannot be printed without characters.

Utf-16 encoding rules and the result of encoding “𢂘” have been described in the previous article.

var ch = "𢂘";
ch.charAt(0) // '\uD848'
ch.charAt(1) // '\uDC98
Copy the code

But aren’t all characters with code points greater than 0xFFFF in UTF-16 encoding, ch.charat (0) and ch.charat (1) all garbled?

This goes back to the encoding process of characters with code points greater than 0xFFFF:

  1. Code a minus0x10000, the resulting value ranges from 20 bits long0... 0xFFFFFIf not, add 0 in front.
  2. The value of the highest 10 bits plus0xD800I get the first code element
  3. The value of the lowest 10 bits plus0xDC00You get the second code element

The value of the Unicode character set ranges from 0 to 0x10FFFF. The value greater than 0xFFFF ranges from 0x10000 to 0x10FFFF minus 0x10000. The value ranges from 0x0000 to 0xFFFFF

  0x10000 ~ 0x10FFFF
  0x10000 ~  0x10000-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- to lose0x10000
  0x00000 ~  0xFFFFF 
Copy the code

Convert to binary

high10A low10position0000000000 0000000000  --- 0xF0000 
1111111111 1111111111  --- 0xFFFFF
Copy the code

The value range of the top 10 digits is 0000000000 to 1111111111 corresponding to hexadecimal 0x0000 to 0x03FF. The value range of the bottom 10 digits is 0000000000 to 1111111111 corresponding to hexadecimal 0x0000 to 0x03FF

The first code element, the top 10 bits plus 0xD800, its value range: 0xD800 + 0x0000 ~ 0xD800 + 0x03FF = 0xD800 ~ 0xDBFF

'0x' + (0xD800 + 0x0000).toString(16).toUpperCase() // 0xD800
'0x' + (0xD800 + 0x03ff).toString(16).toUpperCase() // 0xDBFF
Copy the code

The second code element, the next 10 bits plus 0xDC00, its value range is: 0xDC00 + 0x0000 ~ 0xDC00 + 0x03FF = 0xDC00 ~ 0xDFFF

'0x' + (0xDC00 + 0x0000).toString(16).toUpperCase() // 0xDC00
'0x' + (0xDC00 + 0x03ff).toString(16).toUpperCase() // 0xDFFF
Copy the code

A quick summary:

The value of the first code element ranges from 0xD800 to 0xDBFF. The value of the second code element ranges from 0xDC00 to 0xDFFF

Unicode proxy 0xD800-0xdFFf, which does not represent any character.

The value of the first code and the second code, dead card in the agent area, must be garbled. So characters with code points greater than 0xFFFF.

  1. Its length value is 2
  2. If the index value is used, both characters must be garbled
  3. CharAt must also be garbled

Now you know why it must be gibberish.

0xD8000xDFFF

I don’t know if utF-16 coding idea is really wonderful, I’m going to follow the direction of the derivation. How were the 0xD800 and 0xDC00 derived at the beginning of its design?

For code points greater than 0xFFFF, subtract 0x10000 from 0x0000-0xffFFf

Need to save four bytes, the calculation process refer to the previous why must be �

The first four values range from 0x0000 to 0x03FF the last four values range from 0x0000 to 0x03FF

0x03ff / / 1023
0xDFFF -  0xD800 / / 2047
Copy the code

Interestingly enough, the number of code points available in the proxy region is 2047, which is almost exactly double the number of 0x03FF (1023).

It just so happens that the first four bytes and the next four bytes require 2,046 code points,

Proxy area (Max – min)/2 + min

'0x' + ((0xDFFF - 0xD800) / 2  + 0xD800).toString(16).toUpperCase() // '0xDBFF.8'
Copy the code

That’s about it. I’m really a smart kid.

summary

Did you harvest anything today?