This article is translated

What Every JavaScript Developer should know about Unicode

Originally by Dmitri Pavlutin

Original address: dmitripavlutin.com/what-every-…

The Chinese translation of most professional terms in this paper is based on the corresponding Chinese Wikipedia

Before I begin, I have a confession to make: For a long time, I was terrified of Unicode. Whenever I encounter some programming task that requires Unicode knowledge, I search for a hack solution that I don’t really understand what I’m doing.

I was evading it until I ran into a problem that required a deep understanding of Unicode. But I couldn’t find a solution for my current scenario.

I managed to read a lot of articles – and to my surprise, Unicode was not hard to understand. Though… I read some articles at least three times.

Unicode, it turns out, is a universal and elegant standard. But because of a lot of abstract terms, it is actually quite difficult to keep learning.

If you feel like your Understanding of Unicode isn’t enough, now is the time to confront it head-on! It’s not that hard. Make yourself a delicious cup of tea or coffee ☕. Let’s delve into the wonderful world of abstraction, characters, starlight, and agency.

This article begins by explaining the basic concepts of Unicode to help you lay a solid foundation. How Unicode works in JavaScript and the pitfalls you may encounter in the process will be explained later. You’ll also learn how to apply the new ECMAScript 2015 features to solve some of these problems.

Are you ready? Dry up!

1. The idea behind Unicode

Let’s start with an easy question. How can you read and understand this passage? It’s easy, because you know every word and the meaning of words (made up of words).

And why do you know the meaning of every word? Simply put, it is because you (the reader) and I (the writer) agree that there is a connection between these graphic symbols (what you see on the screen) and the Chinese characters (meaning).

The same thing happens with computers. The difference is that computers can’t understand the meaning of words. To a computer, these words are just sequences of bits.

Imagine a scenario where user 1 sends a hello message to user 2 over the network.

User 1’s computer does not know the meaning of each letter. So it converts Hello to a sequence of numbers 0x68 0x65 0x6C 0x6C 0x6F, where each letter corresponds to a unique number: H for 0x68, e for 0x65, and so on. These numbers are sent to user 2’s computer.

After user 2’s computer receives the numeric sequence 0x68 0x65 0x6C 0x6C 0x6F, it uses the same set of correspondence (between letters and numbers) to restore the message. Then display the correct message: Hello.

The two computers agreed on the correspondence between letters and numbers, and Unicode set the standard.

According to the Unicode standard, H is an Abstract Character called LATIN SMALL LETTER H. This character corresponds to a number 0x68, which is a Code Point. The standard form is U+0068.

Unicode provides an abstract Character Set and assigns each Character a unique identifier, a Coded Character Set.

2. Unicode base terms

The website www.unicode.org says:

“Unicode provides a unique number for every character, no matter what platform, no matter what program, no matter what language.”

Unicode is a universal Character set that defines lists of Character forms for most of the world’s writing systems and associates each Character with a unique number (code point).

Unicode includes characters, punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, emojis, and more in most languages today.

The original Unicode version 1.0 was released in 1991-10 and contained 7,161 characters. The most recent version is 14.0 (released 2021-9) and contains 144,697 characters.

Before Unicode came along, vendors implemented many difficult character sets and encodings, and Unicode solved most of the problems of that time in a universal and inclusive way.

Creating an application that supports all character sets and encodings is complex.

If you think Unicode is hard enough, programming without Unicode will only get harder.

I remember reading files and randomly picking character sets and encodings. It’s a lottery!

2.1 Characters and code points

Abstract characters (or characters) are units of information used to organize, control, or represent textual data.

Unicode uses characters as abstract terms. Each abstract character is associated with A name, such as LATIN SMALL LETTER A. The render form of this character (Glyph) is a.

“A code point is a number assigned to a single character”

Code points are numbers from U+0000 to U+10FFFF.

U+

is the format of code points, where U+ is a prefix for Unicode and

represents a hexadecimal number. For example, U+0041 and U+2603 are code points.

Remember, a code point is a simple number and don’t think about it too much. A code point is an index of the elements in an array.

Unicode relates code points to characters, and things get weird. For example, the character U+0041 corresponds to is named LATIN CAPITAL LETTER A (rendered as A), and the character U+2603 corresponds to is named SNOWMAN (rendered as ☃).

Not all code points are associated with characters. A total of 1,114,112 code points are available (ranging from U+0000 to U+10FFFF), but only 144,697 (as of 2021.9) have been assigned characters.

Unicode Planes

‘Planes’ is a sequence of 65,536 (or 100001610000_{16}1000016) code points from U+ N0000 to U+nFFFF, with n ranging from 0160_{16}016 to 101610_{16}1016′.

The plane divides all Unicode code points into 17 equal groups:

  • Plane 0Contains code points from U+0000 to U+FFFF.
  • Plane 1Contains from U +1To U + 00001The code point of FFFF.
  • Plane 16Contains from U +10To U + 000010The code point of FFFF.

Basic Multilingual Plane

Plane 0 is one of the most special planes, called Basic Multilingual Plane, or BMP for short. It contains characters from most modern languages (Basic Latin, Cyrillic, Greek, etc.) and a large number of symbols (Unicode, English: A Unicode symbol is a Unicode character that is not used to represent a written character, but can be used on text.

To sum up, the basic multilingual plane ranges from U+0000 to U+FFFF, with up to four hexadecimal digits.

Developers typically only deal with characters in BMP. BMP contains most of the necessary characters.

Some characters in BMP:

  • eU+0065, namedLATIN SMALL LETTER E
  • |U+007C, namedVERTICAL BAR
  • sU+25A0, namedBLACK SQUARE
  • U+2602, namedUMBRELLA

Astral Planes

Astral Plane, also known as the Astral Plane, is a Plane of existence assumed by classical, medieval, Eastern, mystical philosophy and mystical religions. It is the realm of the celestial sphere, the place where souls travel after birth and death, and it is believed to be inhabited by angels, spirits, or other immaterial beings. Astral Planes “are The informal name for Supplementary Planes because they are used so infrequently (especially in The late 1990s) that they are as ethereal as The Great Beyond in The occult. Many people object to this humorous designation, and with the widespread use of planes 1 and 2, fewer and fewer people feel that these planes are really “starlands”. But the jocular extension is a harmless reminder that we are far from there. See: www.opoudjis.net/unicode/uni…

In the 16 planes after BMP (plane 1, plane 2… , plane 16) are called Astral Planes or Supplementary Planes

Code points in the starlight plane are called starlight code points. Code points range from U+10000 to U+10FFFF.

A starlight code point has 5 or 6 hexadecimal digits: U+ DDDDD or U+ DDDDDD.

Let’s look at some characters in the starlight plane:

  • 𝄞U+1D11E, namedMUSICAL SYMBOL G CLEF
  • 𝐁U+1D401, namedMATHEMATICAL BOLD CAPITAL B
  • 🀵U+1F035, namedDOMINO TITLE HORIZONTAL-00-04
  • 😀U+1F600, namedGRINNING FACE

2.3 Code Units

OK, the Unicode “characters,” “code points,” and “planes” we just talked about are abstractions.

Now it’s time to talk about how Unicode is implemented at the physical, hardware level.

Computers do not use the concepts of code points or abstract characters at the memory level. It requires a physical way to represent Unicode code points, which are code elements.

“A code element is a sequence of bits used to encode each character in a given encoding format.”

Character encodings convert abstract concept code points into physical binary bits: code elements. In other words, character encodings convert Unicode code points into unique sequences of code elements.

Common character encodings are UTF-8, UTF-16, and UTF-32.

Most JavaScript engines use UTF-16 encoding, so let’s focus on UTF-16 now.

Utf-16 (16-bit Unicode Transformation Format) is a variable-length encoding:

  • The BMP code point is encoded as a 16-bit code point
  • The code points in the starlight plane will be encoded as two 16-bit codes

OK, boring theory is a bit much. Let’s look at some examples.

Suppose you want to save the LATIN SMALL LETTER A character A to your hard drive. Unicode will tell you that the abstract character LATIN SMALL LETTER A maps to code point U+0061.

Now let’s think about how utF-16 code converts U+0061. According to the coding specification, for BMP code points, the hexadecimal number 0061 is extracted and stored in a 16-bit code element, 0x0061.

As you can see, BMP code points fit well into a 16-bit code element.

2.4 Surrogate Pairs

Let’s take another complicated example. Suppose you want to encode the GRINNING FACE character 😀. This character is mapped to the code point U+1F600 in the starlight plane.

Since starlight code points require 21 binary bits to store information, utF-16 requires two 16-bit codes. Code points U+1F600 are split into so-called “proxy pairs” : 0xD83D (high proxy element) and 0xDE00 (low proxy element).

“A proxy pair is a representation for a single abstract character consisting of two sequences of 16-bit proxies, in which the first value is the high level proxy and the second value is the low level proxy.”

The two code elements required by the starlight code point – can be called proxy pairs. For example, encoding U+1F600 (😀) in UTF-16 yields a proxy pair: 0xD83D 0xDE00.

console.log('\uD83D\uDE00'); / / = > '😀'
Copy the code

The high level proxy element ranges from 0xD800 to 0xDBFF. The low level proxy code element ranges from 0xDC00 to 0xDFFF.

The conversion algorithm between the proxy pair and starlight code points is as follows:

function getSurrogatePair(astralCodePoint) {
  let highSurrogate = 
     Math.floor((astralCodePoint - 0x10000) / 0x400) + 0xD800;
  let lowSurrogate = (astralCodePoint - 0x10000) % 0x400 + 0xDC00;
  return [highSurrogate, lowSurrogate];
}
getSurrogatePair(0x1F600); // => [0xD83D, 0xDE00]
function getAstralCodePoint(highSurrogate, lowSurrogate) {
  return (highSurrogate - 0xD800) * 0x400 
      + lowSurrogate - 0xDC00 + 0x10000;
}
getAstralCodePoint(0xD83D.0xDE00); // => 0x1F600
Copy the code

Agents are not comfortable to use. When working with strings in JavaScript, you have to treat them as special cases, which I’ll talk about in a later article.

However, UTF-16 is Memory Efficient. We usually need to deal with 99% of the characters are from BMP, only need a code element, saving a lot of memory.

2.5 Combining Marks

“A Grapheme, or Symbol, is the smallest meaningful unit of writing.”

Bits represent how we view a character. We call the specific graphics rendered by the bits on the display Glyph.

Translator’s Note: The word “glyph” here refers to “a recognizable abstract graphic symbol that does not depend on any particular design”, also known as a character, a grapheme, or a book form, which refers to the shape of a word. Letterform is not letterform.

In most cases, a Unicode character corresponds to a character bit. For example U+0066 LATIN SMALL LETTER F is written as F.

But there are also cases where a bit consists of a series of characters.

For example, a is an atomic bit in the Danish writing system. This will require U+0061 LATIN SMALL LETTER A (rendered as A) to combine A special character U+030A COMBINING RING ABOVE (rendered as Pillar).

U+030A modifies the previous character, which is called a composite character.

console.log('\u0061\u030A'); / / = > 'a'
console.log('\u0061');       // => 'a'
Copy the code

“A composite character is a character that creates bits on top of a previous base character.”

Combination characters include accent marks, diacritics, Hebrew dots, Arabic vowel marks, and Indian matras.

Composite symbols are not usually used in isolation (that is, without base characters). You should avoid displaying them independently.

Just like proxy pairs, composite symbols are difficult to handle in JavaScript.

Sequences of combined characters (base characters + combined characters) are recognized by the user as a single symbol (e.g. ‘\u0061\u030A’ is’ a ‘). But the developer must use two code points U+0061 and U+030A to construct A.

Unicode in JavaScript

The ES2015 specification mentions that the source text uses Unicode (version 5.1 and above). Code points range from U+0000 to U+10FFFF. The format for source code storage and data exchange is not mentioned in the ECMAScript specification, but utF-8 encoding (the preferred encoding for the Web) is usually used.

I recommend preserving the Unicode base Latin alphabet block (or ASCII) characters in the source code. Characters outside ASCII should be escaped. This reduces the probability of coding problems.

Digging deeper, at the language level, ECMAScript 2015 provides a clear definition of String in JavaScript:

“String is the set of all ordered sequences of zero or more 16-bit unsigned integer values (” elements”), with a maximum length of 253−12^{53}-1253−1 element. The String type is typically used to represent text data from running ECMAScript programs, where each element in the String is treated as a UTF-16 code value.”

Each element in the string is parsed by the engine as a code element. The rendering of strings does not provide a definitive way to determine what code points they contain (the code points they represent). Look at this example:

console.log('cafe\u0301'); / / = > 'cafe'
console.log('café');       / / = > 'cafe'
Copy the code

‘cafe\u0301’ and ‘cafe ‘literally have slightly different symbols, but both render the same sequence of symbols cafe.

In the previous chapters on proxy pairs and composite symbols, we saw that some symbols require two or more symbols to be represented. So be careful and take precautions when counting characters or accessing them through indexes:

const smile = '\uD83D\uDE00';
console.log(smile);        / / = > '😀'
console.log(smile.length); / / = > 2
const letter = 'e\u0301';
console.log(letter);        / / = > 'e'
console.log(letter.length); / / = > 2
Copy the code

The smile string contains two code elements: \uD83D (high level proxy) and \uDE00 (low level proxy). Since a string is a sequence of symbols, smile.length yields 2. Even the string smile renders only one symbol 😀.

I recommend always thinking of strings in JavaScript as sequences of code elements. The string is rendered in a way that doesn’t make it clear what code it contains.

The sequence of symbols and combined symbols in the starlight plane requires 2 or more code elements to encode. But it will only be treated as a single bit.

If the string contains proxy pairs and composite symbols, the developer may be unknowingly confused when calculating the length of the string and accessing characters with indexes.

Most JavaScript string methods are not Unicode-aware. If your string contains Unicode compound characters, be on guard when calling methods like myString.slice() and myString.substring().

3.1 Escape Sequences

Escape sequences of JavaScript strings represent code elements based on code point numbers. JavaScript provides three escape types, one of which was introduced in ECMAScript 2015.

Let’s see more details.

Hexadecimal escape sequence

The shortest form of an escape sequence is called a hexadecimal escape sequence: \x

, where \x is prefixed, followed by a fixed two-digit hexadecimal number

. For example, ‘\x30’ (symbol ‘0’), ‘\x5B’ (symbol ‘[‘).

The string literal and regular expression for the hexadecimal escape sequence looks like this:

const str = '\x4A\x61vaScript';
console.log(str);                    // => 'JavaScript'
const reg = /\x4A\x61va.*/;
console.log(reg.test('JavaScript')); // => true
Copy the code

Because only two bits can be used, the hexadecimal escape sequence can escape only a limited range of code points: U+00 to U+FF. But it has the advantage of being short.

Unicode escape sequences

If you want to escape entire BMP code points, you should use Unicode escape sequences. The escape format is \u

, where \u is prefixed, followed by a fixed four-digit hexadecimal number

. For example, ‘\u0051’ (symbol ‘Q’), ‘\u222B’ (integral symbol ‘∫’).

Let’s look at Unicode escape sequences:

const str = 'I\u0020learn \u0055nicode';
console.log(str);                 // => 'I learn Unicode'
const reg = /\u0055ni.*/;
console.log(reg.test('Unicode')); // => true
Copy the code

Because only four bits can be used, Unicode escape sequences can escape a limited range of code points: U+0000 to U+FFFF (all of the BMP code points), which is sufficient to represent commonly used symbols in most cases.

To represent a symbol in a starlight plane in a JavaScript literal, you need to use two Unicode escape sequences linked together (high-order proxy and low-order proxy), which creates a proxy pair:

const str = 'My face \uD83D\uDE00';
console.log(str); // => 'My face 😀'
Copy the code

Code point escape sequence

ECMAScript 2015 provides escape sequences that represent code points in the entire Unicode space (U+0000 to U+10FFFF) : BMP and starlight planes.

The new is more commonly known as a code-point escape sequence: \u{

}, where

is a hexadecimal number of 1 to 6 digits in length. For example, ‘\u{7A}’ (symbol ‘z’), ‘\u{1F639}’ (funny cat symbol ‘😹’).

See how to use it in literals:

const str = 'Funny cat \u{1F639}';
console.log(str);                      // => 'Funny cat 😹'
const reg = /\u{1F639}/u;
console.log(reg.test('be hilarious cat 😹')); // => true
Copy the code

Note that the regular expression /\u{1F639}/u has a special flag u, which is used to turn on additional Unicode features (see 3.5 Regular Expression Matching).

I like to avoid using proxy pairs to represent the starlight plane symbols by code point escape. U+1F607 SMILING FACE WITH HALO

const niceEmoticon = '\u{1F607}';
console.log(niceEmoticon);   / / = > '😇'
const spNiceEmoticon = '\uD83D\uDE07'
console.log(spNiceEmoticon); / / = > '😇'
console.log(niceEmoticon === spNiceEmoticon); // => true
Copy the code

The string literal assigned to the variable niceEmoticon is the escape code point \u{1F607} representing the starlight plane code point U+1F607. However, the underlying code point escape still creates a proxy pair (2 code points). As you can see, spNiceEmoticon uses the Unicode escape ‘\uD83D\uDE07’ to create a proxy pair that is equivalent to niceEmoticon.

If the regular expression is created through the RegExp constructor, you must replace \ with \\ in the string literal to do Unicode escape. The following regular expression objects are equivalent:

const reg1 = /\x4A \u0020 \u{1F639}/;
const reg2 = new RegExp('\\x4A \\u0020 \\u{1F639}');
console.log(reg1.source === reg2.source); // => true
Copy the code

3.2 String Comparison

Strings in JavaScript are sequences of code elements. It is reasonable, then, to assume that string comparisons involve computations of codes, in which case comparing strings is comparing whether the codes contained in two strings are the same.

It’s fast and efficient. This works well with “simple” strings:

const firstStr = 'hello';
const secondStr = '\u0068ell\u006F';
console.log(firstStr === secondStr); // => true
Copy the code

FirstStr and secondStr strings are the same sequence of symbols.

Suppose you want to compare two strings that are rendered the same but contain different sequences of codes. Then you might get an unexpected result, where the strings look the same but the comparisons are not equal:

const str1 = Ca va bien ';
const str2 = 'c\u0327a va bien';
console.log(str1);          // => 'ça va bien'
console.log(str2);          // => 'ça va bien'
console.log(str1 === str2); // => false
Copy the code

Str1 and STR2 render looks the same, but the code elements are different. This happens because the bit C is constructed in two different ways:

  • useU+00E7 LATIN SMALL LETTER C WITH CEDILLA
  • The other uses sequences of combined characters:U+0063 LATIN SMALL LETTER CAdditive combination symbolU+0327 COMBINING CEDILLA.

How do you handle this and compare strings correctly? The answer is Normalization of strings.

Normalization

“Normalization is the conversion of a string to a canonical representation to ensure canonical representation for a canonical and/or compatibility-equivalent string.”

In other words, when a string has a complex structure (including sequences of combined characters or other complex structures), it can be normalized to a canonical form. Normalized strings can painlessly compare or perform string operations such as literal searches, and so on.

Unicode Additional Standard #15 provides interesting details of the normalization process.

Normalize strings in JavaScript by calling the myString.normalize([normForm]) method, which is available in ES2015. NormForm is an optional parameter (default is ‘NFC’) and its value can also be normalized in the following format:

  • 'NFC'For Normalization Form Canonical Composition
  • 'NFD'For Normalization Form Canonical Decomposition
  • 'NFKC'For exploratory Form Compatibility Composition
  • 'NFKD'For exploratory Form Compatibility Decomposition

We improve on the previous example by applying string regularization, which helps us compare strings correctly:

const str1 = Ca va bien ';
const str2 = 'c\u0327a va bien';
console.log(str1 === str2.normalize()); // => true
console.log(str1 === str2);             // => false
Copy the code

‘c ‘and ‘c\u0327’ are standard equivalents. When str2.normalize() is called, the standard version of str2 is returned (‘c\u0327’ replaced with ‘c ‘). So comparing str1 === str2.normalize() returns true as expected. Str1 is not affected by regularization because it is already in canonical form.

It makes sense to normalize both strings to get a canonical representation of the operands.

3.3 String Length

Of course, the most common way to determine the length of a string is to access the myString.length attribute. This attribute can wait until the number of symbols a string has.

Strings that contain only code points in the BMP can indeed be computed this way to get the desired result.

const color = 'Green';
console.log(color.length); / / = > 5
Copy the code

Each symbol in color corresponds to a separate word bit. The expected length is 5.

Length and proxy pair

Things get tricky when the string contains proxy pairs (code points to represent the starlight plane). Because each proxy pair contains two symbols (one high proxy and one low proxy), the Length attribute is larger than expected.

Look at this example:

const str = 'cat\u{1F639}';
console.log(str);        / / = > 'cat 😹'
console.log(str.length); / / = > 5
Copy the code

The string STR, rendered, contains four symbols cat😹. However, smile.length gives a result of 5, because U+1F639 is a starlight plane code point encoded as two code elements (a proxy pair).

Unfortunately, there is no native and efficient way to solve this problem.

But at least ECMAScript 2015 has introduced algorithms for sensing starlight plane symbols. The symbol of the starlight plane is counted as a single character even if it is encoded as two symbols.

Prototype [@@iterator]() is Unicode-aware. You can do this by expanding the operator [… STR] or the array. from(STR) function (both of which call string iterators underneath). Then count the number of symbols returned to the array.

Note that this solution can cause performance problems if widely used.

Let’s improve on the previous example by expanding the operator:

const str = 'cat\u{1F639}';
console.log(str);             / / = > 'cat 😹'
console.log([...str]);        // => ['c', 'a', 't', '😹']
console.log([...str].length); / / = > 4
Copy the code

[… STR] creates an array of four symbols. The proxy pair representing U+1F639 CAT FACE WITH TEARS OF JOY 😹 will remain intact because the string iterator is Unicode-aware.

Length and combination symbols

What about sequences of combined symbols? Because each combined symbol is a symbol, the same problem is encountered.

String normalization solves this problem. If you are lucky, a sequence of combined characters is normalized to a single character. Let’s try:

const drink = 'cafe\u0301';
console.log(drink);                    / / = > 'cafe'
console.log(drink.length);             / / = > 5
console.log(drink.normalize())         / / = > 'cafe'
console.log(drink.normalize().length); / / = > 4
Copy the code

The drink string contains 5 symbols (so drink.length is 5), even though it is rendered as 4 symbols.

When drink is normalized, it is fortunate that the composite character sequence ‘e\u0301’ has a standard form ‘e ‘. So drink.normalize().length will get the expected 4.

Unfortunately, normalization is not a universal solution. Longer sequences of combined characters do not always have a standard equivalent form for a single character. Let’s look at an example:

const drink = 'cafe\u0327\u0301';
console.log(drink);                    / / = > 'caf ȩ ́'
console.log(drink.length);             / / = > 6
console.log(drink.normalize());        / / = > 'caf ȩ ́'
console.log(drink.normalize().length); / / = > 5
Copy the code

Drink has 6 code elements, and drink.length is 6. However, drink has four symbols.

Normalize drink.normalize() converts the combined sequence ‘e\u0327\u0301’ to the standard two-character form ‘ȩ\u0301’ (with only one combined symbol removed). Sadly, drink.normalize().length gets 5 and still doesn’t represent the correct number of symbols.

3.4 Character Location

Since a string is a series of symbols, accessing a character through a string index presents some challenges.

If the string contains only BMP characters (excluding U+D800 to U+DBFF high-level proxies and U+DC00 to U+DFFF low-level proxies), character location works fine.

const str = 'hello';
console.log(str[0]); // => 'h'
console.log(str[4]); // => 'o'
Copy the code

Each symbol is encoded as a single symbol, so access through the index is OK.

Character location and proxy pair

The situation is different once the string contains the symbol for the starlight plane.

The symbol of the starlight plane is encoded using 2 code elements (proxy pairs). So by accessing characters in a string through an index, you might get a separate high-order proxy or low-order proxy, neither of which is a valid symbol.

const omega = '\u{1D6C0} is omega';
console.log(omega);        // => '𝛀 is omega'
console.log(omega[0]);     // => "(unprintable characters)
console.log(omega[1]);     // => "(unprintable characters)
Copy the code

Because U+1D6C0 MATHEMATICAL BOLD CAPITAL OMEGA is a starlight plane character, it is encoded as a proxy pair containing two code elements. Omega [0] will access the high level proxy, omega[1] will access the low level proxy, and the proxy pair is broken.

There are two normal ways to access the starlight plane symbol in a string:

  • Generate symbol arrays using Unicode-aware string iterators[...str][index]
  • By calling thenumber = myString.codePointAt(index)Get the code number and passString.fromCodePoint(number)Convert numbers into symbols. (Recommended)

Let’s try these two methods:

const omega = '\u{1D6C0} is omega';
console.log(omega);                        // => '𝛀 is omega'
// Option 1
console.log([...omega][0]);                / / = > '𝛀'
// Option 2
const number = omega.codePointAt(0);
console.log(number.toString(16));          // => '1d6c0'
console.log(String.fromCodePoint(number)); / / = > '𝛀'
Copy the code

[…omega] returns an array of symbols that omega contains. The proxy pair is evaluated correctly, so it can access the first character as expected. Smile […] is [0] ‘𝛀’.

The omega. CodePointAt (0) method call is Unicode-aware, so it returns the star code number 0x1D6C0 for the first character in the omega. The string.fromCodePoint (number) function returns the symbol ‘𝛀’ based on the code point number.

Character location and combination symbols

The same problem applies to character positioning of strings containing composite symbols.

Accessing a character through a string index is equivalent to accessing a code element. However, sequences of combined symbols should be accessed as a whole and should not be separated into multiple symbols.

The following example illustrates the problem:

const drink = 'cafe\u0301';  
console.log(drink);        / / = > 'cafe'
console.log(drink.length); / / = > 5
console.log(drink[3]);     // => 'e'
console.log(drink[4]);     / / = > ◌ ́
Copy the code

Drink [3] will access only the basic character E, without the combination character U+0301 COMBINING ACUTE ACCENT. Drink [4] accesses the separate combination character Of the ́ ́.

This case normalizes the string. U+0065 LATIN SMALL LETTER E plus U+0301 COMBINING ACUTE ACCENT there is a standard equivalent form U+00E9 LATIN SMALL LETTER E WITH ACUTE E . Let’s improve on the previous code example:

const drink = 'cafe\u0301';
console.log(drink.normalize());        / / = > 'cafe'
console.log(drink.normalize().length); / / = > 4
console.log(drink.normalize()[3]);     / / = > 'e'
Copy the code

Note that not all sequences of combined characters have a standard equivalent of a single character. So regularization is not a universal solution.

Fortunately, this approach works for most European and American languages.

3.5 Regular Expression Matching

Regular expressions, like strings, work on the basis of codes. Similar to the scenarios I mentioned earlier, proxy pairs and combined character sequences can cause problems with regular expressions.

BMP character matches as expected because a single bit represents a single symbol:

const greetings = 'Hi! ';
const regex = / /. {3};
console.log(regex.test(greetings)); // => true
Copy the code

The three symbols in greetings are encoded into three code elements. The regular expression /.{3}/ wants to match three tokens to match greetings.

When matching symbols in the starlight plane (which is encoded as a proxy pair of 2 symbols), you may encounter some problems.

const smile = '😀';
const regex = $/ / ^;
console.log(regex.test(smile)); // => false
Copy the code

Smile contains the symbol U+1F600 GRINNING FACE in the starlight plane. U+1F600 will be encoded as the proxy pair 0xD83D 0xDE00. However, the regular expression /^.$/ expects to match a symbol, so the match fails: regexp. Test (smile) is false.

The situation is even worse when using symbols in the starlight plane to define Character Classes. JavaScript throws an error:

const regex = / [😀 - 😎] /;
// => SyntaxError: Invalid regular expression: /[😀-😎]/: 
// Range out of order in character class
Copy the code

Symbols in the starlight plane are encoded as surrogate pairs. So JavaScript uses the symbol /[\uD83D\uDE00-\uD83D\uDE0E]/ in regular expressions. Each symbol is treated as a separate element in pattern, so regular expressions ignore the notion of a proxy pair. Because \uDE00 is greater than \uD83D, the \ ude00-\ uD83D part of the character class is illegal. Therefore, mistakes are made.

The regular expression flag u

Fortunately, ECMAScript 2015 introduces a very useful U flag that brings Unicode-aware capabilities to regular expressions. This flag, when turned on, correctly handles starlight plane characters.

You can use Unicode escape sequences /u{1F600}/u in regular expressions. This escape is written shorter than the high level agent + low level agent pair /\uD83D\uDE00/.

Let’s apply the u flag to see how the. Operator () matches the starlight plane symbol:

const smile = '😀';
const regex = /^.$/u;
console.log(regex.test(smile)); // => true
Copy the code

The /^.$/u regular expression, which became Unicode-aware due to the U flag, now matches 😀.

With the u flag enabled, you can also handle starlight plane symbols in character classes:

const smile = '😀';
const regex = / [😀 - 😎] / u;
const regexEscape = /[\u{1F600}-\u{1F60E}]/u;
const regexSpEscape = /[\uD83D\uDE00-\uD83D\uDE0E]/u;
console.log(regex.test(smile));         // => true
console.log(regexEscape.test(smile));   // => true
console.log(regexSpEscape.test(smile)); // => true
Copy the code

[😀-😎] now you can get the range of starlight plane symbols. /[😀-😎]/u matches 😀.

Regular expressions and composite symbols

Unfortunately, the regular expression treats the U flag as a separate symbol whether or not it is used in the regular expression.

If you need to match a sequence of combined characters, you must match the base character and the combined symbol separately.

Look at the following example:

const drink = 'cafe\u0301';
const regex1 = $/ / ^. {4};
const regex2 = $/ / ^. {5};
console.log(drink);              / / = > 'cafe'
console.log(regex1.test(drink)); // => false
console.log(regex2.test(drink)); // => true
Copy the code

The rendered string contains four symbols cafe.

However, the regular expression /^.{5}$/ matches ‘cafe\u0301’ as having five elements.

4. To summarize

Probably the most important concept of Unicode in JavaScript is to treat strings as sequences of code elements, which is what strings really look like.

Confusion can arise when developers assume that bits (or symbols) make up strings and ignore the concept of sequences of symbols.

When working with strings, if you include proxy pairs or sequences of combined characters, you are prone to some potholes:

  • Get the length of the string
  • Characters locate
  • Regular expression matching

Note that most methods in JavaScript are not Unicode-aware: myString.indexof (), myString.slice(), etc.

ECMAScript 2015 introduces great features such as code point escape sequences \u{1F600} in strings and regular expressions.

The new regular expression flag u enables Unicode-aware string matching. This simplifies the matching of symbols in the starlight plane.

Prototype [@@iterator]() is Unicode-aware. You can use the expansion operators [… STR] or array.from (STR) to create symbolic arrays, and then calculate the string length or access the characters by index without breaking the proxy pair. Note that these operations affect performance.

If you need a better way to handle Unicode characters, you can use the utility libraries punycode or generate to generate professional regular expressions.

I hope this article has helped you master Unicode!

Do you know anything else interesting about Unicode in JavaScript? Feel free to comment below!