Read Unicode

1. What is Unicode?

The context in which Unicode emerged: the unification of all languages into a single code, compatible with ASCII. As for why to look at the previous article.

Unicode provides a unique number for each character. This number is called a Code Point in Unicode. At its core, Unicode provides a unique numeric number for each character.

The relationship between Unicode and UTF-X: Unicode is a character set, and UTF-32/ UTF-16/ UTF-8 are three character encoding schemes.

2. The Unicode code point

Code Point (Code Point) is equivalent to the ASCII value in THE ASCII Code, which is the unique identifier of a character in the Unicode character set.

Corresponding to code points is a concept called code element.

A Code Unit is a concept that appears after a Code point is stored on a computer. For UTF-8, the code unit is 8 bits long; For UTF-16, the code unit is 16 bits long. More on code elements later.

The code point is expressed as U+[XX]XXXX

X represents a 16-digit number, which can have 4-6 digits. If there are less than 4 digits, 0 makes up for 4 digits. If there are more than 4 digits, there are as many digits as possible. The specific range is U+0000 to U+10FFFF. The theoretical size is 10FFFF+1=0x110000. So 17 times 2 to the 16 is 17 times 65536. This is a number in the millions.

According to Unicode, this is the end of the code point range, and there is no further extension.

In order to better classify and manage such a large number of code points, every 65536 code points are regarded as a Plane, a total of 17 planes.

3. Plane, BMP, and SP

Plane: namely the set of code points (space).

For details on which planes contain which characters, see this website.

BMP 3.1:

The first of the 17 planes is Basic Multilingual Plane (BMP). Also called Plane 0, its code point range is from U+0000 to U+FFFF. This is also the plane we use most, and most of the characters we use in our daily life fall in this plane. Utf-16 requires only two bytes to encode characters in this plane. Characters that are not included are placed in the plane behind.

Let’s look at the specific code points in BMP:

CJK Unified ideogram of China, Japan and South Korea

Range: 4E00 — 9FFF; Total number of characters: 20992. Among them, the range of Chinese characters is: 4E00-9FD5.

So it is common to see the regular expression [\ u4e00-\ u9FD5] used to match Chinese. In less stringent applications, it is basically OK to judge by the above range, but it is not completely right. /\p{sc=Han}/gu

Surrogate Area

Range: D800 — DFFF; D800 — DBFF and DC00 — DFFF belong to the High Surrogate Area, and their respective sizes are 4×256=1024. These two areas form a two-dimensional table with a total of 1024×1024=210×210=24×216=16×65536, so it can exactly represent all the characters in the added 16 planes.

The agent of

A Pair of code points of a high Lead and a low Trail is a Surrogate Pair, which must be in the order of two highs and two lows. It is illegal to create a Surrogate Pair with two highs, two lows, or two lows.

SP 3.2:

The next 16 Planes are called Supplementary Planes (SP). The code points in these planes have exceeded the theoretical upper limit of the 16-bit space, so utF-16 uses four-byte encoding for characters in these planes.

Note: Many of these planes are still empty and have not assigned any characters yet, just so many planned out.

About 24 million Chinese characters are included in cihai.

Also: Some are Private, like the last two Private Use Planes in the figure above, where you can customize characters.

4. Unicode

Code point is only an abstract concept, is a process of digitizing characters, is only an abstract coding.

The whole coding can be divided into two processes. First, the characters in the program are digitized to a specific value based on the number in the character set, and then stored in the computer in a specific way based on the number.

4.1 Two layers of Unicode encoding

1. Abstract coding level

To encode a character into a number. There are no details about how many bytes each number is represented, whether fixed or variable length is used.

2. Specific coding level

The code-point to final encoding conversion is known as UTF (Unicode Transformation Format).

This layer encodes the numbers (code points) in the abstract coding layer into the final storage form (UTF-X). Code point conversion into various encoding (encode), involving the encoding process of fixed length and variable length of two implementation methods, fixed length of a few bytes; What kind of byte lengths are available with variable length, how to distinguish them from each other, etc.

Utf-32 is a fixed-length encoding that stores code points in 4 bytes forever, while UTF-8 and UTF-16 are variable-length storage. Utf-8 uses 1 to 4 bytes depending on the situation, while UTF-16 uses 2 or 4 bytes to store code points.

Note: on the previous level, character and number have been one-to-one correspondence, the character encoding is in essence character encoding.

4.2 Large and Small Ends and BOM

When storing data, data can be arranged left to right or right to left from low to high. Most modern computers use byte addressing, that is, 1 byte per address number. Suppose the address of variable I is 8000h, byte 01H, 23H, 45H, 67H should each have a memory address, then address 0800H corresponds to which of the four bytes address? This is the byte order problem.

Multi-byte data are stored in continuous byte sequence. According to the sequence of each byte in the continuous byte sequence, there are two types of arrangement: Big endian and Little Endian, as shown in the figure.

In big-endian mode, data is stored in the order from the most significant byte to the least significant byte, that is, the most significant byte is stored first.

In small-endian mode, data is stored in order from least significant byte to most significant byte, that is, the least significant byte is stored first.

BOM

BOM=Byte Order Mark. It identifies which end method to use.

Bytes	Encoding Form
00 00 FE FF	UTF-32, big-endian
FF FE 00 00	UTF-32, little-endian
FE FF	UTF-16, big-endian
FF FE	UTF – 16, little endian
EF BB BF	UTF-8

For the sake of illustration, the following is small-endian.

4.3 UTF – 32

For example, the code point U+0041 (binary: 1000001) of character A is encoded by UTF-32 and stored in the computer in the following form:

00000000 00000000 00000000 01000001

Advantages: High search efficiency, time complexity O(1)
Cons: Wasted space

It will fill in all of the top four bytes with zeros. The biggest disadvantage of this representation is that it takes up too much space, because no matter how big the code point is, it needs four bytes to store, which takes up too much space. So how to break this bottleneck? The variable length scheme came into being.

4.4 UTF-8

Utf-8 is a variable-length encoding that can use 1 to 4 bytes to represent a symbol, varying the length of the byte according to the symbol. The high reserved method is used to distinguish different variable lengths as follows:

For single-byte symbols, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.
For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are all set to 10. The remaining bits, not mentioned, are all Unicode codes for this symbol.

Decoding utF-8 encoding is also simple. If the first byte is 0, the byte is a single character. If the first digit is 1, the number of consecutive 1’s indicates how many bytes the current character occupies.

Conversion between code points and UTF-8:

Take the Chinese character “yi” (U+0041) for example:

4.5 UTF – 16

Utf-16 is a variable-length 2 – or 4-byte encoding scheme. Characters in the BMP are 2-byte encoded, and others are encoded using 4-byte proxy pairs.

Conversion between code points and UTF-16:

For characters on the BMP plane, no conversion is required.

The characters in the SP supplementary plane need to be calculated accordingly.

Lead = (code point -0x10000) ÷ 0x400 + 0xD800 Trail= (code point -0x10000) % 0x400 + 0xDC00Copy the code

Take 😀 (U+1F600) for example:

Lead = (1F600-10000) ÷ 400 + D800 = F600 ÷ 400 + D800 = 3D + D800 = D83D Trail= (1F600-10000) % 400 + DC00 = F600 % 400 + DC00 = 200 + DC00 = DE00Copy the code

Therefore, the proxy pair corresponding to 😀 (U+1F600) is D83D DE00.

5 Bidirectional text

Bidirectional text is a string that contains both left-to-right and right-to-left characters.

Most scripts are written left-to-right, such as Latin (English letters) and Chinese characters, and a few are written right-to-left, such as Arabic (AR) and Hebrew (he). In typography and layout, you need to consider the problems of bi-directional text.

5.1 Logical Sequence and Display sequence

Specified in the logical order in the Unicode standard, order to text in memory, and the order of the display is finally displayed in front of us can see the order of the text, the text of the logical order and display order will not necessarily agree, for example displays the text from right to left, shows that the order should be from right to left, and the logical order may be from left to right.

In all major Web browsers, the order of characters in memory (logical) is different from the order in which they are displayed (visual). Unicode defines the orientation attribute for each character in it, and browsers use a rule when rendering text to automatically determine the orientation of text Unicode to produce the correct order. This rule is described by the Unicode bidirectional algorithm, also known simply as the bidI algorithm.

5.2 Unicode Character direction Attributes

Unicode direction attributes are of three types: strong, weak, and neutral characters. There are many subcategories of attributes under these three main types.

The directional attributes of strong characters are deterministic, independent of the bidI attributes of the context, and strong characters may affect neutral characters before and after them in the BIDI algorithm. Most of the characters are strong characters, such as Latin characters, Chinese characters, Arabic characters.
The directionality of neutral characters is uncertain, affected by the BIDI attribute of their context (strong characters before and after). For example, most punctuation marks (” – “, “[]”, “()”, etc.) follow Spaces.
Weak characters are directional, but have no effect on the BIDI properties of their context. Like numbers and symbols associated with numbers.

directional	Related characters	The effect
Left-to-Right (LTR)	Strong characters from left to right (English letters, Chinese characters and most of the world`Left - > right`Written words)	Directional determination, LTR or RTL, is context-independent. And may affect the direction of the characters before and after it.
Right-to-Left (RTL)	Strong characters from right to left (Arabic, Hebrew, and`Right - > left`Written words)	Directional determination, LTR or RTL, is context-independent. And may affect the direction of the characters before and after it.
Left-to-Right (LTR) / Right-to-Left (RTL)	Weak characters (numbers and numeral-related symbols)	Like strong characters, directionality is deterministic, but does not affect the directionality of the preceding and following characters.
Neutral	Neutral characters (most punctuation and Spaces)	Directionality is indeterminate and is determined by context.

5.3 the direction

Global direction

The global direction is also called the base direction. Global direction is the overall direction in a text. The order in which the text is displayed on the page depends on the main global direction. The global direction of a text is determined mainly by the following points:

Default from document (HTML)Left - > rightInheritance.
If there are relevantdirAttributes ordirectionStyle, the direction is specified according to the corresponding value.

The browser sets the default base orientation based on your default language, such as left to right for English and Chinese, and right to left for Arabic.

<p>tencent<bdo dir="rtl">Tencent Docs</bdo>The document</p>
Copy the code

The direction of string

A directional string is a string of consecutive characters with the same directivity in a text, but not preceded by other directional strings with the same directivity.

For example 🌰 : After entering a string of numbers (123)-456-789, type و button دت and the text will automatically change from right to left.

(123)-456-789و button دت becomes the image below. If you copy and paste to an editor that does not support bidirectional text, still (123)-456-789و button دت.

Weakly typed numbers retain their natural orientation, while neutral characters ()-+ follow the global orientation given by the first strongly typed text.

The text above is divided into 7 direction strings. Because the neutral symbol is affected by the global direction, the original number is divided into different direction strings and reordered.

For this purpose, the Unicode standard defines a series of directional control characters that are not displayed on the interface. For example U+202E, you can force text right -> left:

5.4 Control Characters

The Unicode bidirectional algorithm can calculate and correctly display bidirectional characters based on character attributes and global directions. In this mode, the display of bidirectional characters is basically done by the algorithm without human intervention. However, when the implicit schema algorithm is inadequate in dealing with two-way text in complex cases, it can be supplemented by explicit schema. In explicit mode algorithms, in addition to implicit algorithm operations, directional Unicode control characters can be added to the bidirectional text to control the text display. These Unicode control characters added to text are not visible on the display screen and do not take up any display space. They just silently influence the display of two-way text.

6. The related API

chatAt()

Function: Returns a character at a given index position. Specifically, this method finds the 16-bit code at the specified index position and returns the character corresponding to that code:

console.log('String ABC'.charAt(1)); / / operator
console.log('😁'.charAt(1)); / / �
Copy the code

charCodeAt()

Function: Returns the code element at the given index position. If the Unicode code point cannot be represented by a UTF-16 code element (because its value is greater than 0xFFFF), the returned code element will be the first code element of the code point proxy pair. If you want the entire codepoint value, use codePointAt().

console.log('abc'.charCodeAt(0)); / / 97
console.log('😁'.charCodeAt(0)); / / 55357

console.log('abc'.codePointAt(0)); / / 97
console.log('😁'.codePointAt(0)); / / 128513
Copy the code

codePointAt()

Function: Returns a symbol at a given index position. If there are no UTF-16 proxy pairs to start with at the index, the encoding unit at that index is returned directly. If the code index passed is not the beginning of the proxy pair, an error code point is returned.

console.log('abc'.codePointAt(0)); / / 97
console.log('😁'.codePointAt(0)); / / 128513
console.log('😁'.codePointAt(1)); / / 56833
Copy the code

fromCharCode()

Function: Creates characters in a string based on the given UTF-16 code. This method can take any number of numeric values and return a string that concatenates the corresponding characters of all numeric values.

// The Unicode encoding for "Latin Small Letter A" is U+0061 0x0061 === 97

console.log(String.fromCharCode(97)); // a
console.log(String.fromCharCode(0x61)); // a
Copy the code

For characters from U+0000 to U+FFFF, length, charAt(), charCodeAt(), and fromCharCode() all return the same results as expected.

7. Typesetting and line breaking

Line breaks the rules

In Word typesetting, the typesetting of a certain line needs to consider how lines are broken between lines. For Example, in Chinese, “don’t put” at the end of a line; in English, keep a word intact; and in emoji, do the same.

Line breaking algorithm:

The line breaking algorithm is based on the Box-Glue-Penalty model.

Box: an indivisible block
Glue: can be stretched or contracted
The Penalty for breaking a line

Typesetting optimality = minimum stretch and contraction, minimum line breaking penalty (e.g., as few hyphens as possible)

First first-fit algorithm

Arrange words and Spaces in order of their original length until the next word reaches the end of the page

Disadvantages: Spaces can only be stretched, not compressed

Advantages: O(N) complexity, simple implementation

Best first-fit algorithm

Set the maximum stretch length and minimum compression length of the space, through greedy algorithm to ensure that the current line layout is optimal

Disadvantages: greedy algorithm, not global optimal solution

Advantages: O(n) time complexity

Total-fit algorithm (Knuth-PlASS)

An Algorithm for Optimal Line breaking under Box-Glain-Penalty Model (Dynamic Programming)

Advantages: typesetting effect is best

Disadvantages: Nonlinear complexity, slow speed. The effect was not obvious for CJK. Does not apply to cases with different row heights and surround.