Author: Ji Zhi

Recently, I learned the lexical analysis of Babylon and found an interesting code, which made me puzzled.

// this.input Specifies the input string
// this.state.pos Is the index of the currently parsed string
fullCharCodeAtPos() {
  const code = this.input.charCodeAt(this.state.pos);
  if (code <= 0xd7ff || code >= 0xe000) return code;

  const next = this.input.charCodeAt(this.state.pos + 1);
  return (code << 10) + next - 0x35fdc00;
}
Copy the code

From the method name, the purpose is to get a Unicode code point for a character. CharCodeAt returns the encoding unit of a character. What is the function of the following if statement?

When encountering a problem, I first checked the document of charCodeAt on MDN.

The charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index. The charCodeAt method returns an integer between 0 and 65535 representing the UTF-16 encoding unit at the given index.

From the above, the charCodeAt method is invalid if the code unit exceeds 65535, but why is the threshold 65535 and when does the code unit exceed 65535? With that in mind, I did some googling and found it was a huge pit.

Unicode

Computers only deal with numbers, so every piece of information stored must have a unique logo. Different countries have designed different standards, such as ASCII, GB2312, GBK, GB18030. Unicode was created to solve the limitations of traditional character sets. And override the characters of any language in the world. It starts at 0 and maps each character with a unique number of the form U+x{4,6}, where x is a hexadecimal number with at least four and at most six digits.

U+4e00 = "one"Copy the code

The above four hexadecimal numbers are the Unicode encoding units, also known asCode points (code point), ranging fromU+0000U+10FFFF, contains 1114112 (17 *) code points.

UTF-16

Unicode just specifies the code points for each character, so how to encode a file is a matter of encoding methods. There are UTF-8, UTF-16, UTF-32, etc. Let’s look at UTF-16. Because utF-16 code points are mentioned according to the description of MDN charCodeAt method.

  1. The basic definition

    Code range byte The name of the
    0x0000 – 0xFFFF 2 Basic Multilingual Plane
    0x10000 – 0x10FFFF 4 Supplementary Multilingual Plane

    The characteristic of UTF-16 is thatFixed lengthandlonger, it divides all code points into two ranges, where0x0000 - 0xFFFFOn behalf ofBasic Multilingual Plane (BMP)Contains most of our common characters, and the rest areSupplementary Multilingual Plane (SMP), the maximum value of the BMP character code point is0xFFFF(), 65536, because it takes 8 bits to represent a byte, it takes 2 bytes to represent a BMP character code point. So forSMP, characters in commonThat is, it takes 20 bits to be completely unique, and that producesThe agent ofThe concept of Surrogate pairs.

  2. The agent of

    Due to theAdd multilingual planeThe characters ofA, andBasic multilingual planefromU+D800U+DFFFIt’s an empty segment, which means that these code points don’t map any charactersA, 16 to utf-8Add multilingual planeThe first 10 bits are mapped from U+D800 to U+DBFF (space size)), called high order (H), the last 10 bits map from U+DC00 to U+DFFF (space size), called low order (L), and they compriseThe agent ofThat is to sayAdd multilingual planeThe characters are composed of twoBasic multilingual planeThe character representation of.

  3. Utf-16 transcoding formula

    For example, if a character is a basic multilingual plane, it can be converted to hexadecimal. For example, the copyright symbol “©️” has the code point U+00A9.

    U+00A9 = 0x00A9
    Copy the code

    So for the supplementary multilingual plane, the word “𠮷”, its code point is U+20BB7, utF-16 gives the formula is:

    // C is the code point, corresponding to 0x20BB7 above
    H = Math.floor((C - 0x10000) / 0x400) + 0xD800
    L = (C - 0x10000) % 0x400 + 0xDC00
    Copy the code

    It calculates that the high order is \ud842 and the low order is \udfb7, as you can see if you print the following on the browser console

    console.log("\ud842\udfb7") / / print: "𠮷"
    Copy the code

    If you print its length, you’ll be even more surprised!

    console.log("𠮷".length) // Print: 2
    Copy the code

    If you are making a form textarea to limit the number of words, will it feel a headache?

    Of course, there are many more amazing things you can do in JavaScript, which will be covered below.

  4. How are the high and low levels split

    I am a person with strong curiosity. Of course, the above mentioned paragraph is also obtained from my research.

    Basic multilingual planefromU+D800U+DFFFIt’s an empty segment, which means that these code points don’t map any charactersA, 16 to utf-8Add multilingual planeThe first 10 bits are mapped from U+D800 to U+DBFF (space size)), called high order (H), the last 10 bits map from U+DC00 to U+DFFF (space size), called low (L).

    If you think about it, how does the first 10 bits map to the high range, and the last 10 bits map to the low range?

    There is a specific formula involved here, see section 2.1,

    1) Let U' = (u-0x10000).toString(2) // U' = yyyyYYYYYYXXXXxxxxxx // the first ten bits are high (starting from 'U+D800', 'U+DC00', 'U+DC00', 'U+DC00', 1101110000000000) L = 110111XXXXXXXXXXCopy the code

    To test this formula, take the example of “𠮷” above.

    // The first step is to convert to binary
    let U = (0x20BB7 - 0x10000) .toString(2) / / "10000101110110111"
    
    // add 0 in the second step
    U = "00010000101110110111"
    
    // The third step is to intercept the string to get the high and low bits
    H = "110110" + "0001000010"
    L = "110111" + "1110110111"
    
    // The fourth binary is converted to hexadecimal
    H = 0xd842
    L = 0xdfb7
    
    // The result of utF-16 transcoding formula is identical to that of utF-16 transcoding formula, which verifies the principle of proxy pair
    Copy the code

babylon fullCharCodeAtPos

Once you are familiar with the utF-16 proxy pair concept, go back to the if statement in fullCharCodeAtPos at the beginning of this article, because charCodeAt returns the high-order code point of the proxy pair if it encounters a character that complicates the multilingual plane. The interval of the proxy pair is [0xD800, 0xDFFF], so if it is not in this interval, it indicates that it is a character of the basic multi-language plane. Otherwise, the low level code point of the proxy pair is obtained, and the Unicode code point that supplements the multi-language plane character is calculated by the following algorithm.

// this.input Specifies the input string
// this.state.pos Is the index of the currently parsed string
fullCharCodeAtPos() {
  const code = this.input.charCodeAt(this.state.pos);
  if (code <= 0xd7ff || code >= 0xe000) return code;

  const next = this.input.charCodeAt(this.state.pos + 1);
  return (code << 10) + next - 0x35fdc00;
}
Copy the code

UTF-8

Utf-16 requires at least two or four bytes. In the case of character “A”, the code point is U+0061. In fact, only one byte is needed, which is an 8-bit binary number. Utf-8 is a variable length encoding method, with characters ranging from 1 byte to 4 bytes. The first 128 characters are represented by only 1 byte, which is exactly the same as ASCII and therefore more space-saving.

  1. The basic definition

    Code range byte
    0x0000 – 0x007F 1
    0x0080 – 0x07FF 2
    0x0800 – 0xFFFF 3
    0x010000 – 0x10FFFF 4

    For UTF-8, character code points are in different intervals, so the number of bytes is different, so if I read a file, how to determine whether I should read one or more bytes to form a character, this involves utF-8 transcoding formula.

  2. Utf-8 transcoding formula

    Unicode symbol range Utf-8 encoding mode
    0x0000 – 0x007F 0xxxxxxx
    0x0080 – 0x07FF 110xxxxx 10xxxxxx
    0x0800 – 0xFFFF 1110xxxx 10xxxxxx 10xxxxxx
    0x010000 – 0x10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    The steps are as follows:

    1. For a single byte, that is, a character with a Unicode code point less than 128, the highest bit is0, the remaining bits are the binary of the symbol’s Unicode code point value.
    2. For n(n >= 2) bytes, the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are all set to 10. The remaining bits, not mentioned, are all the bits of the symbol’s Unicode code point (starting with the lower digit, and filling in zeros if there are not enough digits).

    A lot of people are confused about step two, so here’s an example. The Unicode code point value of Chinese character 1 is U+4E00, which is “100111000000000” when translated into binary.

    If you look at the table above, first is the interval in the third row, which is three bytes long, then you fill the converted binary from right to left to get the following.

    1110{0}100 10111000 10000000 // The 0 in the curly bracket is the high level supplementary 0 // Converted to hexadecimal: 0xE4B880, three bytesCopy the code

That’s the basics of UTF-16 and UTF-8, and Unicode In JavaScript.

Unicode In JavaScript

First, JavaScript uses the “Unicode” character set, so what is the encoding method? Some authorities have the answer.

From the JS engine is using UTF-16, from the JS language design, using UCS-2

For us developers, how do we solve the following JavaScript problems?

  1. Gets the character code point

    let s = "𠮷"
    s.charCodeAt(0) // 55362 error! All you get is the high level of the proxy pair
    
    // ES6
    s.codePointAt(0) // 134071 correct!
    
    // Low version compatibility -- MDN version one
    var codePointAt = function(position) {
      if (this= =null) {
        throw TypeError(a); }var string = String(this);
      var size = string.length;
      // `ToInteger`
      var index = position ? Number(position) : 0;
      if(index ! = index) {// better `isNaN`
        index = 0;
      }
      // Account for out-of-bounds indices:
      if (index < 0 || index >= size) {
        return undefined;
      }
      // Get the first code unit
      var first = string.charCodeAt(index);
      var second;
      if ( // check if it’s the start of a surrogate pair
        first >= 0xD800 && first <= 0xDBFF && // high surrogate
        size > index + 1 // there is a next code unit
      ) {
        second = string.charCodeAt(index + 1);
        if (second >= 0xDC00 && second <= 0xDFFF) { // low surrogate
          // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
          return (first - 0xD800) * 0x400 + second - 0xDC00 + 0x10000; }}return first;
    };
    
    // Low version compatible - Babylon version 2
    function fullCharCodeAtPos(input) {
      const code = input.charCodeAt(0);
      if (code <= 0xd7ff || code >= 0xe000) return code;
    
      const next = input.charCodeAt(1);
      return (code << 10) + next - 0x35fdc00;
    }
    Copy the code
  2. String length

    "𠮷".length / / 2
    
    // The Unicode code point for "𠮷" belongs to the supplementary multilingual plane
    // A proxy pair of two basic multilingual plane code points of length 2
    
    // Solution 1:
    const punycode = require('punycode');
    punycode.ucs2.decode('𠮷').length / / 1
    
    // Solution 2:
    var regexAstralSymbols = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
    
    function countSymbols(string) {
      return string
        // Replace every surrogate pair with a BMP symbol.
        .replace(regexAstralSymbols, '_')
        / /... and *then* get the length.
        .length;
    }
    countSymbols('𠮷').length / / 1
    
    / / ES6 scheme
    function countSymbols(string) {
      return Array.from(string).length;
    }
    
    countSymbols('𠮷').length / / 1
    
    function countSymbols(string) {
      return [...string].length;
    }
    
    countSymbols('𠮷').length / / 1
    Copy the code
  3. String inversion

      "abc".split(' ').reverse().join(' ') // "cba"
      "𠮷".split(' ').reverse().join(' ') / / "??"
    
      // "𠮷" reverses "\ud842\udfb7"
      // change to "\udfb7\ud842"
    
    
      / / address:
      const esrever = require('esrever');
      console.log(esrever.reverse('𠮷')) / / "𠮷"
    Copy the code
  4. Regular expression

    /foo.bar/.test('foo 💩 bar') // false
    
    // ES6
    /foo.bar/u.test('foo 💩 bar') // true
    Copy the code

    The above statement can only match the high order “\ud83d” of the agent pair 💩 (“\ud83d\ UDca9 “), but ES6 has added a U flag to support this.

    You can also use the following regs. Refer to Regenerate for details:

    / [\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](? ! [\uDC00-\uDFFF])|(? :[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]/.test('💩') // true! wtf~~
    Copy the code

conclusion

From the above solution, ES6 is still partially Unicode enhanced, and if you understand the UTF-16 encoding above, you will feel no pressure to read ruan’s EXTENSION for ES6 strings. If you’re having Unicode problems, especially with forms, rich text emojis and other features, consult Mathias Bynens, a leading expert on the subject, for some answers.

References:

Zh.wikipedia.org/wiki/Unicod…

javascript-encoding

javascript-unicode

UTF-16 RFC

www.ruanyifeng.com/blog/2014/1…

Unicode-table.com/en/#control…