When working with Chinese and other Unicode characters in JavaScript, we use unicode-specific apis.

In the early days, provided by the JavaScript String. Prototype. CharCodeAt and String fromCharCode is able to convert a String to Unicode utf-8 from utf-8 encoding, and 16 coding into the function of the String.

Such as:

Const STR = 'Chinese '; console.log([...str].map(char => char.charCodeAt(0))); / / [20013, 25991]Copy the code

Here, we expand the string into a single character, and then use charCodeAt method to convert the string into the corresponding Unicode code. Here, 20013 and 25991 are the Corresponding Unicode codes of the two characters “Chinese”.

Similarly, we can use fromCharCode to convert Unicode encodings to strings:

const charCodes = [20013, 25991]; console.log(String.fromCharCode(... charCodes)); / / ChineseCopy the code

I believe most students are familiar with these two methods, which have been supported since ES3. However, this approach is not enough when dealing with Unicode characters today.

Why is that? Let’s look at an example:

Const STR = '🀄'; console.log(str.charCodeAt(0)); / / 55356Copy the code

This character is we are familiar with mahjong in the red, now many input methods can be directly played out, it seems to be normal, there is no problem ah?

But try again:

console.log(String.fromCharCode(55356)); / / �Copy the code

In fact, the utF-16 encoding of the Unicode character 🀄 is not 55356. In this case, if you use charCodeAt to get the UTF-16 encoding of the character 🀄, you should have two values:

Const STR = '🀄'; console.log(str.charCodeAt(0), str.charCodeAt(1)); / / 55356 56324Copy the code

From String. FromCharCode (55356, 56324) to restore the 🀄 character.

In addition, there are some other differences, such as:

The console. The log (' 🀄. Length); // The string contains 2 '🀄'.split(''); // ["�", "�"] split two characters /^.$/.test('🀄'); // falseCopy the code

👉🏻 Knowledge: In the Unicode standard, characters are divided into 17 planes based on their code points. All the code points range from 0x000000 to 0x10FFFF, and a total of 3 bytes are used.

The first byte is the plane number, from 0x0 to 0x10, 17 planes in total.

Plane 0 is known as the BMP (Basic Multilingual Plane), and all character code points in this Plane need only 16-bit encoding units to represent them, so they can continue to use UTF-16 encoding.

The other planes are called supplementary planes. The characters in these planes are called supplementary characters and their code points exceed the range of 16 bits.

ES5 and previous JavaScript unicode-related apis can only handle BMP characters in UTF-16, and all string operations are based on 16-bit encoding units.

Therefore, when supplementary characters such as 🀄 appear, the results are not expected.

After ES2015, JavaScript provides a new API to support Unicode code points, so we can use it like this:

Const STR = '🀄'; console.log(str.codePointAt(0)); / / 126980Copy the code

👉 🏻 knowledge: String. Prototype. CodePointAt (index) method returns a String of character at the specified index in the Unicode code, compared with the old charCodeAt methods, it can effectively support supplementary characters.

Instead, we have the string. fromCodePoint method to convert a CodePoint to its corresponding character:

console.log(String.fromCodePoint(126980)); / / 🀄Copy the code

Unicode escape

JavaScript strings support Unicode escape, so we can use a hexadecimal string of code points prefixed with \u to represent a character, for example:

console.log('\u4e2d\u6587'); / / ChineseCopy the code

0x4e2D and 0x6587 are hexadecimal representations of 20013 and 25991, respectively.

Note that Unicode escapes can be used not only for strings, but actually for identifiers and conversions as well. For example, we could write:

Const \u4e2d\ log = 'test '; The console. The log (Chinese); / / testCopy the code

In the above code, we define a Chinese variable, and when we declare it, we use Unicode to escape console.log and use its variable name character.

This notation for \u and hexadecimal characters also applies only to BMP characters, so if we try to use it to escape supplementary characters, it won’t do:

console.log('\u1f004'); / / ὆ 4Copy the code

This way, the engine parses \ u1F004 into a string of the character \u1f00 and the Arabic numeral 4. We need to include the encoding with {}, and that’s it:

console.log('\u{1f004}'); / / 🀄Copy the code

Surrogate Pair

To distinguish BMP from surrogate planes, Unicode introduces surrogate pairs, which specify two 16-bit coding units to represent a code point. The rules are as follows:

  • Characters in BMP are still represented by two bytes according to UTF-16 encoding rules.
  • Supplementary characters use two sets of 16-bit codes to represent a character as follows:
    • First subtract 0x10000 from its code
    • Then write it as the 20-bit binary form yyyy yyyy YYxx XXXX XXXX
    • It is then encoded as 110110YY YYYYYYYY 110111XX XXXXXXXX, a total of four bytes.

110110YYYYYYYYYY and 110111XXXXXXXXXX are two proxy characters that form a group of proxy pairs. The first proxy character ranges from U+D800 to U+DBFF, and the second proxy character ranges from U+DC00 to U+DFFF.

Implement getCodePoint

Now that we understand proxy pairs, we can implement getCodePoint with charCodeAt:

function getCodePoint(str, idx = 0) { const code = str.charCodeAt(idx); if(code >= 0xD800 && code <= 0xDBFF) { const high = code; const low = str.charCodeAt(idx + 1); return ((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000; } return code; } the console. The log (getCodePoint (', ')); / / 20013 console. The log (getCodePoint (' 🀄 ')); / / 126980Copy the code

Similarly, we can implement fromCodePoint from fromCharCode:

function fromCodePoint(... codePoints) { let str = ''; for(let i = 0; i < codePoints.length; i++) { let codePoint = codePoints[i]; if(codePoint <= 0xFFFF) { str += String.fromCharCode(codePoint); } else { codePoint -= 0x10000; const high = (codePoint >> 10) + 0xD800; const low = (codePoint % 0x400) + 0xDC00; str += String.fromCharCode(high) + String.fromCharCode(low); } } return str; } console.log(fromCodePoint(126980, 20013)); / / 🀄Copy the code

So we can use this idea to implement polyfill in early browsers. In fact, MDN official notes on codePointAt and fromCodePoint provide corresponding polyfill method according to the above ideas.

getCodePointCount

JavaScript string length can only get the number of UTF-16 characters, so as seen earlier:

The console. The log (' 🀄. Length); // Contains 2 charactersCopy the code

There are several ways to obtain the Unicode character count, such as using the spread operation to convert Unicode strings into arrays, so:

function getCodePointCount(str) { return [...str].length; } the console. The log (getCodePointCount (' the 👫 '));Copy the code

Or use a regular expression with the u descriptor:

function getCodePointCount(str) { let result = str.match(/./gu); return result ? result.length : 0; } the console. The log (getCodePointCount (' the 👫 '));Copy the code

extension

Unicode code points used a fixed four bytes to encode supplementary characters, while earlier utF-8 encodings used a variable one to six bytes to encode Unicode characters.

Utf-8 encoding is as follows:

byte The starting Termination of byte1 byte2 byte3 byte4 byte5 byte6
1 U+0000 U+007F 0xxxxxxx
2 U+0080 U+07FF 110xxxxx 10xxxxxx
3 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 U+10000 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 U+200000 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 U+4000000 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The default encoding for encodeURIComponent and Node Buffer is UTF-8:

The console. The log (encodeURIComponent (', ')); // %E4%B8%ADCopy the code
Const buffer = new buffer (' middle '); console.log(buffer); // <Buffer e4 b8 ad>Copy the code

Here E4, B8, AD are three bytes of hexadecimal code, we try to rotate:

const byte1 = parseInt('E4', 16); // 228 const byte2 = parseInt('B8', 16); // 184 const byte3 = parseInt('AD', 16); // 173 const codePoint = (byte1 & 0xf) << 12 | (byte2 & 0x3f) << 6 | (byte3 & 0x3f); console.log(codePoint); / / 20013Copy the code

We remove the three-byte control codes 1110, 10, and 10, respectively, and then concatenate them in order from highest to lowest, giving us exactly the code point 20013 for ‘middle’.

So we can also use utF-8 encoding rules and write another version of the generic method to implement getCodePoint:

function getCodePoint(char) { const code = char.charCodeAt(0); if(code <= 0x7f) return code; const bytes = encodeURIComponent(char) .slice(1) .split('%') .map(c => parseInt(c, 16)); let ret = 0; const len = bytes.length; for(let i = 0; i < len; i++) { if(i === 0) { ret |= (bytes[i] & 0xf) << 6 * (len - i - 1); } else { ret |= (bytes[i] & 0x3f) << 6 * (len - i - 1); } } return ret; } the console. The log (getCodePoint (', ')); / / 20013 console. The log (getCodePoint (' 🀄 ')); / / 126980Copy the code

So again, we can implement fromCodePoint:

function fromCodePoint(point) { if(point <= 0xffff) return String.fromCharCode(point); const bytes = []; bytes.unshift(point & 0x3f | 0x80); point >>>= 6; bytes.unshift(point & 0x3f | 0x80); point >>>= 6; bytes.unshift(point & 0x3f | 0x80); point >>>= 6; if(point < 0x1FFFFF) { bytes.unshift(point & 0x7 | 0xf0); } else if(point < 0x3FFFFFF) { bytes.unshift(point & 0x3f | 0x80); point >>>= 6; bytes.unshift(point & 0x3 | 0xf8); } else { bytes.unshift(point & 0x3f | 0x80); point >>>= 6; bytes.unshift(point & 0x3f | 0x80); point >>>= 6; bytes.unshift(point & 0x1 | 0xfc); } const code = '%' + bytes.map(b => b.toString(16)).join('%'); return decodeURIComponent(code); } console.log(fromCodePoint(126980)); / / 🀄Copy the code

If there is anything else you would like to discuss about Unicode, please leave a comment in the issue.