Technology sharing – Unicode and JavaScript

I wonder if readers have been bothered by the following three questions.

Problem a:

In one iteration, we encountered such a problem: there was a string with emoji, and the length of the string was inconsistent between the front end and the back end, causing problems in determining the maximum length of the string. What caused the difference?

Problem two:

"😀".charAt(0) = = ="👍".charAt(0)
Copy the code

The charAt() method returns the specified character from a string. – the MDN

Question 3:

Can you say exactly the length of the following characters?

‘a’.length

‘hi’. Length

‘𠮷’.length

‘💩’.length

‘🤦 🏻 came ️’ length


Unicode

It all comes down to Unicode. Unicode is a familiar and unfamiliar word to us.

Unicode is an industry standard in computer science. It organizes and encodes most of the world’s writing systems, making it easier for computers to present and process text. – a wiki

Before Unicode, english-speaking countries used ASCII, but many Asian countries, such as China, had their own codes, such as GB2312. The transfer of files between countries is bound to cause garbled problems. Unicode helped us unify these rules.

Before we dive deeper into Unicode, let’s take a look at a few terms

Several terms

  • Code points:

    In Unicode, each symbol corresponds to a sequence recorded as U+

  • Scripts:

    A collection of letters and symbols used to represent textual information in one or more writing systems. For example, there is a script for each language.

  • Planes:

    All code points are managed with 17 planes in Unicode. There are 65,536 16416^{4}164 code points in each plane.

    • The first group is the most familiar Basic Multilingual Plane (BMP). Why is it the most familiar? Because it contains most of the common language characters located in this plane, such as ASCII characters, Chinese characters and so on.
    • The other 16 groups are Supplementary Multilingual Plane (SMP). It stores some unusual graphic characters, such as a plane containing some ancient Chinese characters.

  • Code unit: A binary sequence that is actually stored inside a computer.

This may still be a little vague, don’t worry, let’s slowly look down.

Unicode is a set of symbols that maps a symbol to a unique hexadecimal sequence, but does not dictate how a computer should store the code point.

For example, the Unicode for Chinese characters is the hexadecimal number 4E25, which has a full 15 bits to convert to binary (100111000100101), meaning that the representation of the symbol requires at least 15 bits. If you want to represent something larger, you might need three bytes or four bytes or more.

This can cause problems:

  1. Computer cannot know a symbol with a few bytes, because in Unicode, there are 4 bytes represents a symbol, the least need only 1 byte (English letters), but if we tell the computer to use 4 bytes to represent a symbol, for English is a serious waste.
  2. On some ancient computer systems, eight consecutive zeros were considered the end of a string.
  3. Forward compatibility requires compatibility with machines that only understand ASCII.

Hence character encoding. Translate the code points we mentioned earlier into a sequence of unique code units. Common encoding modes are UTF-8 and UTF-16. Each character encoding defines its own code units.

UTF – 8

Define a code unit as eight bits in UTF-8.

He had two rules:

  1. For a one-byte symbol, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.
  2. For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are all set to 10. The remaining bits, not mentioned, are all Unicode codes for this symbol.
Unicode encoding (hexadecimal) Utf-8 byte stream (binary)
000000-00007F 0xxxxxxx
000080-00007FF 110xxxxx 10xxxxxx
000800-00FFFF 1110xxxx 10xxxxxx 10xxxxxx
010000-10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For each character, the computer can quickly determine how many code units this character has from the first few bits of its first code unit (the first code unit starts with several ones, indicating how many code units there are).

Let’s go back to the strict word. Its Unicode is 4E25 (100111000100101). So what are the results of conversion using UTF-8?

  1. It can be seen from the above table that 4E25 falls into the third tier, so we will choose the form of 1110XXXX 10XXXXXX 10XXXXXX to represent our final binary sequence.
  2. The binary sequence converted to 4E25 is filled in from back to front, and the remaining free bits are filled in with zeros.

UTF – 16

Most JavaScript engines use UTF-16 encoding, which is the root cause of all the weird problems we encounter with strings on a daily basis.

In UTF-16, a code unit is 16bits.

And provides that:

  1. The BMP code point is stored in a code unit (16bits)
  2. The SMP is stored in two code units (32bits)

For BMP, there is no difference before and after encoding, except that hexadecimal is changed to binary.

But SMP is more complicated. We said earlier that SMP has a Unicode range of 0x10000-0x10FFFF. Starting at 0x10000, we already need 17 bits

Since there are 2202^{20}220 characters in the supplementary multilingual plane, and in the basic multilingual plane, we reserved an empty segment (U+D800 to U+DFFF) that does not map any characters, there are 2112^{11}211 in total, Utf-16 splits the 20 bits of the supplementary multilingual plane in half. The first 10 bits are mapped in the empty segment from U+D800 to U+DBFF (space size 2102^{10}210), known as the high bits. The last 10 digits are mapped from U+DC00 to U+DFFF (space size 2102^{10}210) in the empty segment, known as lows, which form Surrogate pairs. In other words, characters supplementary to the multiracial plane are represented by characters of two basic multiracial planes.

That explains why

'💩'.length == 2 // true
Copy the code
Code Point UTF-16
💩 U+1F4A9 ‘\uD83D\uDCA9’

For specific transformations of proxy pairs, you can focus on the following articles

  • Mathiasbynens. Be/notes/javas…

Now you can understand Grapheme and Grapheme Cluster better.

A grapheme is the smallest unit of text in writing and can be understood as a single “word”. In the Unicode standard, a Character generally refers to a code point. In general, a grapheme is a character. However, some graphemes are composed of multiple character sequences, and such character sequences are called grapheme clusters. For example, the letter E can be formed by combining the letter E (U+0065) with an accent (U+0301). A Character such as an accent used to modify a previous Character is called a Combining Character.

Emoji expression in Unicode

But why 🤦🏻♂️. Length == 7? This has to do with emoji combinations.

Width joiner ZWJ

The key is zero-width hyphen. It can hyphenate two characters that would not otherwise be hyphenated. When a ZWJ is placed between two emojis, it can be made into a new character.

For example, you can open the console and type in “\ UD83d \ UDC68 \u200d\ UD83d \ UDc69 \u200d\ UD83d \ UDC66 \ U200d \ UD83d \ UDC66 “.

\ u200D is here the utF-16 symbol representation of ZWJ. He can glue multiple emojis into one emoji.

Emoji modifier

In addition, there are other emoji modifiers, such as emoji modifiers, which can be used to modify an emoji’s skin tone

Let’s take a look at the composition of firefighters

The difference between UTF-8 and UTF-16

On most web sites, UTF-8 encoding is superior to UTF-16 encoding because it uses less memory. Recall that UTF-8 encodes each ASCII character in only one byte. Utf-16 must encode these same characters as two or four bytes. This means that an English text file encoded in UTF-16 is at least twice the size of the same text file encoded in UTF-8.

Unicode in JavaScript

ES2015

String length

For regular SMPS, we can use String.prototype@iterator. Because it is Unicode-aware and combines [… STR]or array. from(STR)(both attributes combine string iterators), it ends up with an unbroken proxy pair, each of which is an Array of separate symbols. However, when dealing with complex emojis, this method still has some drawbacks. When dealing with combined emojis (such as 🤦🏻♂️ above), the length is still not the answer we want. So using string iterators is not a good way to meet our needs and can only be implemented through third-party dependent libraries.

  • Recommended dependency library: Punycode.js by Mathias Bynens
Regular expression
/foo.bar/.test('foo 💩 bar') // false
// ES6

/foo.bar/u.test('foo 💩 bar') // true
Copy the code

. Can only match the high order “\ud83d” of the proxy pair 💩 (“\ UD83d \ UDca9 “), but ES6 adds a U flag to support matching.

Additional uses of U Flag can be found in the following video

  • Mathias Bynens: RegExp.prototype.unicode | JSConf EU 2015

ES2018

Unicode property escapes

The Unicode standard assigns various attributes and attribute values to each symbol. For example, to get a set of symbols specific to Greek scripts, you can set the Script property to Greek in the Unicode database. Unfortunately, these Unicode character attributes are not currently available in ECMAScript regular expressions. This makes it difficult for developers to support full Unicode in regular expressions.

To address this issue, ES2018 introduced Unicode property Escapes. Previously, developers who wanted to use equivalent regular expressions in JavaScript had to resort to dependencies or build scripts, both of which led to performance and maintainability issues. With built-in support for Escaping Unicode attributes, it’s easy to create regular expressions based on Unicode attributes.

  1. Match the languages

Unicode Script divides characters according to the writing system they belong to, which generally corresponds to a particular language. For example, \p{Script=Greek} means Greek, \p{Script=Han} means Chinese.

Matches Chinese characters in the following strings

let input = `I'm chinese! I'm Chinese

console.log(input.match(/\p{Script=Han}+/u)) 

// [" I'm Chinese ", index: 12, input: "I'm Chinese! ", groups: undefined]
Copy the code
  1. Match the emoji

    const regex = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
    Copy the code

    We can use the above re to match emojis

    const regex = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}? |\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
    
    const text = '\u{231A}: ⌚ default emoji presentation character (Emoji_Presentation) \u{2194}\u{FE0F}: ↔️ default text presentation Character Rendered as emoji \u{1F469}: 👩 emoji modifier Base (Emoji_Modifier_Base) \u{1F469}\u{1F3FF}: 👩🏿 emoji modifier base followed by a modifier ';
    
    let match;
    
    while (match = regex.exec(text)) {
    
      const emoji = match[0];
    
      console.log(`Matched sequence ${ emoji }- code points:${ [...emoji].length }`);
    
    }
    Copy the code

    But it still fails to match complex emojis like 🤦🏻♂️

ESNext

Intl.Segmenter: Unicode segmentation in JavaScript

Unicode has already defined grapheme segmentation algorithms to help us find boundaries between graphemes. But at present, this part of the algorithm can only be implemented by developers themselves, and now it is finally hoped to be implemented by javascript native methods.

let segmenter = new Intl.Segmenter("cn", {granularity: "grapheme"}); Let input = "how many words? 🤦 🏻 came ️ "; let segments = segmenter.segment(input); for (let {segment, index, isGraphemeLike} of segments) { console.log("segment at code units [%d, %d): «%s»%s", index, index + segment. Length, segment, isGraphemeLike?Copy the code

Granularity is currently available in three types: Grapheme, Word, and Sentence

Finally! 🤦🏻♂️ segments are only used as a segment, instead of five segments. At present, this proposal is still in the third phase of TC39, and will be available soon!

Reference:

  • A Programmer’s Introduction to Unicode
  • What every JavaScript developer should know about Unicode
  • JavaScript’s internal character encoding: UCS-2 or UTF-16?
  • Mathias Bynens: RegExp.prototype.unicode | JSConf EU 2015
  • Unicode In JavaScript
  • Unicode property escapes | MDN
  • JavaScript has a Unicode problem
  • Mathiasbynens. Be/notes/es – UN…
  • UNICODE EMOJI Standard
  • Github.com/tc39/propos…
  • Github.com/tc39/propos…
  • Github.com/tc39/propos…
  • Speakingjs.com/es5/ch24.ht…
  • Character encoding notes: ASCII, Unicode and UTF-8