An overview of the

This article introduces Unicode encoding and its two encodings utF-8 and UTF-16, allowing readers to learn about string encodings and understand the relationship between Unicode and UTF-8 and UTF-16.

The main contents of this paper are as follows:

  • Unicode encoding, including the basics of Unicode encoding and its relationship to utF-8 and UTF-16 encoding
  • Utf-8 encoding, including the basic concepts and Unicode encoding conversion to UTF-8 encoding
  • Utf-16 encoding, including the basic concepts and Unicode encoding conversion to UTF-16 encoding
  • String and DOMString in JavaScript

This article serves as a basic knowledge base for UTFx. js source code parsing. By understanding the utF-8 and UTF-16 encoding methods, readers will be able to understand the principles of encoding transformations using JavaScript.

If you want to see how encoding conversions can be used, read my previous blog WebSocket series on how JavaScript strings can be converted to and from binary data.

If you want to learn more about utfx.js source code, check out my next article.

Unicode

concept

Unicode (unified Code, universal code, single code) is an industry standard in the field of computer science, including character set, encoding scheme, etc. Unicode is produced to solve the limitations of traditional character encoding schemes. It sets a uniform and unique binary encoding for each character in each language, so as to meet the requirements of cross-language and cross-platform text conversion and processing. Development began in 1990 and was officially announced in 1994.

Normally Unicode encoders use 2 bytes to represent A character, such as U+A12B. The binary representation of 2 bytes results in 1010(A)0001(1) 0010(2)1011(B).

Having briefly introduced Unicode, let’s take a look at UTF-8 and UTF-16. Note that UTF stands for Unicode TransferFormat, and both UTF-8 and UTF-16 are encoding methods for converting Unicode code into program data.

UTF-8

concept

Utf-8 (8-bit Unicode Transformation Format) is a variable length character encoding for Unicode, also known as Unicode. Founded in 1992 by Ken Thompson. It is now standardized to RFC 3629. Utf-8 encodes Unicode characters in 1 to 6 bytes. Used in the web page can be unified page display Chinese simplified traditional and other languages (such as English, Japanese, Korean).

Utf-8 is a very general variable length character encoding.

First of all, what is variable-length coding? Variable length encoding means that the representation length of a character is not fixed when it is encoded. As in UTF-8, the ASCII character set is expressed in 1 Byte, while most Chinese characters are expressed in 3 Byte.

This saves a lot of storage when most characters can be represented in 1 Byte, compared to Unicode’s uniform 2 Byte representation. However, if you encounter characters that need to be represented in more than 2 bytes, the UTF-8 encoding consumes more storage space.

representation

As we can see from the above introduction, different Unicode codes occupy different storage space in UTF-8. Let’s take a look at the steps to convert Unicode characters to UTF-8 encoding using a table. Of them? Represents the binary position occupied by the Unicode code after conversion to UTF-8 encoding.

Unicode code range Utf-8 encoding mode
U+0000~U+007F 0????????
U+0080~U+07FF 110??????? 10??????
U+0800~U+FFFF 1110?????? 10?????? 10??????
U+10000~U+10FFFF 11110??? 10?????? 10?????? 10??????

When we get the Unicode code, we first judge its range according to the table above, then convert the Unicode code into binary representation, intercept the length of the UTF-8 encoding from the back to the front, and fill in the corresponding positions from the front to the back to get the UTF-8 encoding. Here are two examples:

  • U+0020, the character is less than 0000 007F, so only 1 Byte is needed for encoding.U+0020The binary representation of0000 (0) 0000 (0) 0010 0000 (0), (2)So you cut 7 bits from the back to the front to get010, 0000,In utF-8 encoding mode, the result is00101111, convert to hexadecimal2F. So the order in which it’s stored in memory is2F.
  • U+A12B, the character is greater than 0000 0800 and less than 0000 FFFF, so 3 bytes is required for encoding.U+A12BThe binary representation of1010(A)0001(1) 0010(2)1011(B). , so take 16 bits from the back to the front10100001, 00101011,(the Unicode code itself), into utF-8 encoding, the result is11101010 10000100 10101011, convert to hexadecimalEA84AB. So, the order of storage in memory isEA 84 AB.

From the examples above, I believe you have a good understanding of utF-8 encoding. Now, let’s look at another encoding, UTF-16.

UTF-16

concept

Utf-16 is an implementation of the third layer of Unicode’s five-level Character Encoding model: Character Encoding Form (also known as “Storage Format”). That is, the abstract code point of the Unicode character set is mapped into a sequence of 16-bit long integers (that is, symbols, length of 2 bytes) for data storage or transmission. Unicode character code points require one or two 16-bit long symbols to represent, so this is a variable-length representation.

Quoting wikipedia’s interpretation of utF-16 encoding, we can see that UTF-16 also uses at least 2 bytes to represent a character, so there is no way to be compatible with ASCII encoding (which uses 1 Byte for storage).

representation

In UTF-16, we divide Unicode into two ranges, each of which is stored in a different way. See the following figure for details.

Unicode range Utf-16 encoding mode
U+000~U+FFFF 2 Byte storage, encoding equal to Unicode value
U+10000~U+10FFFF 4 Byte storage. Now subtract (0x10000) from the Unicode value to get a value of 20bit length. Unicode is then divided into the high 10 bits and the low 10 bits. The high Byte of THE UTF-16 encoding is 2 bytes, and the high 10 bits of Unicode range is00x3FFTo add the Unicode value0XD800, get the high agent (or known as the leading agent, store the high); The lower order is also 2 bytes, as is the lower tenth Unicode range0~0x3FFTo add the Unicode value0xDC00, to get the low level agent (or called the back end agent, storing the low level)

With the above conversions, we are able to convert Unicode codes to utF-16 encodings. Here are two examples:

  • U+0020, the value ranges from the first part, that is, after UTF-16 encoding, the result is stillU+0020, the order in memory is00 20.
  • U+12345, this value ranges in the second part, so you need to subtract first0x10000,0x02345, split into the top 1000 0000 1000And low 1011 0100 0101. After adding specific values according to the above rules, the high proxy value isD808, the low proxy value isDF45, the final order in memory isD8 08 DF 45.

String and DOMString in JavaScript

In JavaScript, all string types (or domStrings as they are called) are encoded using UTF-16.

Therefore, when we need to convert to binary to communicate with the back end, we need to pay attention to the encoding.

conclusion

This article introduces the Unicode encoding and UTF-8 and UTF-16 encoding methods, so that you can understand the Unicode encoding and the two related program data encoding methods.

This article is intended as a basic knowledge base for UTFx.js source code analysis. A follow-up article, UTFx.js source code Analysis, will be provided later in the article, which will give you an understanding of how to do related coding transformations in JavaScript.