I have not updated my blog for nearly a year. This is the first blog in 2020, which is based on the consolidation of basic knowledge. It mainly explains the origin and internal principles of character encoding from multiple dimensions.

As you know, all the information in the program is stored in binary form at the bottom of the computer, which means that a char or an int that we define in the code will be converted into binary code and stored, and this process is called encoding, The process of converting the computer’s underlying binary code into meaningful characters on the screen (such as “Hello World”) is called decoding.

In computer Character codec involves the concept of Character Set (Character Set), which is equivalent to a mapping table that can correspond a Character to an integer one by one. Common Character sets include ASCII, Unicode and so on.

Many times we confuse the encoding of a character set with a character set. From this we can see that they are not the same concept. A character set is just a set of characters, while encoding is a more complex process. Why these two concepts are often put together, what is the connection between them, and what utF-8 is often used, that’s what I’m going to discuss in this article.

ASCII code

For a long period of time, in the history of computer only application of some developed countries in Europe, so only exist in their programs they understood the Latin alphabet (a, b, c, d, etc.) and Arabic numerals, they only need to consider when coding decoding this kind of situation, is how to convert these characters into binary number that the computer can understand, At this time, the ASCII character set came into being. They only need to compare with the ASCII character set when coding. Whenever they encounter character A in the program, they will find the ASCII value 97 corresponding to A and save it in binary form.

The following figure shows the ASCII character set mapping table, including control characters (enter, backspace, newline, etc.) and displayable characters (upper and lower case characters, Arabic numerals, and Western symbols).

This encoding is known as ASCII encoding. As you can see from the character set reference table, the ASCII character set supports 128 characters, all of which can be represented using only 7 bits, that is, the last 7 bits of a byte. The first bit should be 0 (for example, 0110 0001 indicates a).

Later, in order to represent more characters commonly used in European countries, such as the signed character e in French, an additional ASCII extension, EASCII, was developed, which can use a full subsection of 8 bits to represent a total of 256 characters, including some derived Latin letters.

The ASCII coding

The ASCII character set is still in use today, but its biggest drawback is that it can only represent basic Latin letters, Arabic numerals, and British punctuation marks, so it can only represent modern American English (and all stresses have to be removed when dealing with naive, cafe, elite, and so on). Although EASCII has solved the display problem of some Western European languages, after computers are introduced to Asia, the languages of various countries still cannot be fully expressed.

In this era, every country has their own expansion of the ASCII character set, the most representative is the domestic GB Chinese character coding model, which stipulates: Characters with an ASCII value less than 127 have the same meaning as characters in the original ASCII set, but when two characters with an ASCII value greater than 127 are joined together, they represent a simplified Chinese character, and the preceding byte (high byte) is expanded from 0xA1 to 0xF7. The next byte (low byte) ranges from 0xA1 to 0xFE, so that about 7000 simplified Chinese characters can be combined.

In order to unify the decoding operation, GB type code table also added mathematical symbols, Roman and Greek letters, Japanese kana, etc., even in the ASCII originally some numbers, punctuation, letters are unified re expressed to two bytes long code, this is what we often say “full corner” characters, The original characters below 127 were called “half corner” characters, and this encoding rule became GB2312.

“A Chinese character counts as two English characters! One Chinese character counts as two English characters…”

The following figure shows a small part of the GB2312 character set. For details, see the GB2312 Simplified Chinese encoding table.

In this way, we have our own Chinese character set, but Chinese characters are too much, people soon found that THE GB2312 character set can only be the point of Chinese characters is obviously not enough (such as the former Chinese Premier MAO Zedong’s “yong” word is not in the GB2312 character set), Experts then went on to use the code points not used in GB2312 for other characters that were not included.

The first byte greater than 127 is the beginning of a Chinese character, regardless of whether it is followed by an extended character set or not. The resulting expanded coding scheme, known as the GBK standard, includes all the content of GB2312, while adding nearly 20,000 new Characters (including traditional characters) and symbols.

At that time, various countries developed their own coding standards like China, and then when we needed to use computers in line with international standards, the problem appeared! Countries do not understand each other’s codes. 130 stands for e in The French code, Gimel (ג) in the Hebrew code, and another symbol in the Russian code. But in all of these encodings, the symbols from 0 to 127 are still the same because they are compatible with ASCII, which is still the case today.

Unicode

As mentioned in the previous section, countries around the world have different encodings, and the same binary number can be decoded into different symbols. Therefore, in order to open a text file, you must know how it is encoded, otherwise you will get garbled characters if you read it in the wrong encoding. To solve this problem, the ultimate aggregator came along, the Unicode Character set, which incorporated all the symbols in the world, successfully realizing that each number represented a unique symbol used in at least one language. More than 130,000 characters have been included in the Unicode character set (100,000characters were adopted in 2005). It is worth noting that Unicode is still ASCII compatible, meaning 0 to 127 remains the same.

Code points

Unicode is a character set, which is different from utF-8, UTF-6 and other encoding methods. The number described in this section is equivalent to the ASCII value in THE ASCII code. It is the unique identifier of a character in the Unicode character set. In Unicode they are also called Code points, such as Code Point U+0061, where 61 is the hexadecimal representation of 97 and represents the character ‘A’ in the Unicode character set.

The code point is expressed in the form of U+[XX]XXXX, where X represents a hexadecimal number. Generally, there can be 4-6 digits. If the number is less than 4 digits, 0 is added to make up 4 digits; if the number is more than 4 digits, there are several digits. According to The Official Unicode code, the code point range is this, and will not be expanded in the future, more than a million is enough, and currently only about 110,000 characters are defined.

In the whole coding process, code points serve as an intermediate transition layer, which can be represented by the following diagram:

As you can see from this diagram, the whole decoding process can be divided into two processes. First, the characters in the program are digitized to a specific number according to the number in the character set, and then stored in the computer in a specific way according to the number.

Obviously, at this point we can see that the number is not what ends up in the computer. As previously understood, encoding is a character encoded as a binary number stored, but this is not accurate, the real encoding is not only that, it involves the details of how many bytes each number is represented, and whether it is a fixed or variable length representation.

For example, if the code point of character A is U+0061 (97 in decimal), then how to store the code point U+0061? The simple representation of U+0061 can be directly represented by the 7-bit binary number 110 0001. But in a GB-like encoding scheme you need to store 0000 0000 0110 0001 in two bytes (the space is filled with zeros).

Unicode

There are three encoding schemes derived from The Unicode character set, utF-32, UTF-16 and UTF-8, which makes it different from the previous encoding scheme, because the character sets and encoding modes of the ASCII and GBK encoding schemes are all one-to-one corresponding, while there are three encoding implementations of Unicode. This is one of the reasons we need to distinguish between character sets and encodings, because Unicode does not specifically refer to UTF-8 or UTF-32 at this point.

Next, let’s take a look at the following schematic diagram to explore how code points are translated into various codes under various coding modes:

The table above contains four-character code points and shows the encoding results of the four different code points in UTF-32, UTF-16, and UTF-8 encoding modes. Among them: code point to UTF-32 conversion is the simplest, is filled in front of 0 full 4 bytes can be; The code point to UTF-8 conversion, except for the smallest one is the same in value, the other three are not related at all; The conversion from the code point to UTF-16 is the most irregular. You can see that the first three characters utF-16 and the code point are identical, but the larger code point (more precisely, the code point above U+FFFF) is quite different, becoming four bytes long and having a very different value.

This involves two implementation methods of fixed length and variable length in the encoding process. Here, UTF-32 belongs to fixed length encoding, that is, it always uses 4 bytes to store code points, while UTF-8 and UTF-16 belong to variable length storage, UTF-8 uses 1-4 bytes according to different situations. Utf-16 uses 2 or 4 bytes to store code points.

Constant length is longer than variable length

Why should there be two forms of coding: constant length versus variable length? In Chinese expression, there is the so-called broken sentence problem, if we can not deal with broken sentence is likely to convey the meaning of the wrong. Take this note from a fortune teller:

The rich have no disaster to be careful

At this point, if the fortune-teller breaks the sentence like this:

Big rich big expensive, no disaster to be careful

It means that I have a big fortune and no disaster, so I can do whatever I want. But it didn’t take long for this chivalrician to pass away. Mr. Calculating name said in despair, “You will get the wrong idea.

Big rich big rich no, disaster to be careful

If you are not rich, you should be careful when you go out.

This is why computers need to use fixed and variable length in decoding. Because the underlying binary code of a computer is as amorphous as the contents of a fortune-teller’s note, we need a set of rules if we are to understand it properly.

UTF-32

In utF-32, the code point U+0041 (1000001 in binary) of character A is encoded by UTF-32 and stored in the computer as follows:

00000000 00000000 00000000 01000001

It will fill all the high points of the four bytes with zeros. The biggest disadvantage of this representation is that it takes up too much space, because no matter how large the code point is, it needs four bytes to store it, which takes up too much space. So how to break through this bottleneck? The variable-length scheme emerged.

UTF-8

Utf-8 is a variable-length encoding. It can consist of four byte combinations, 1, 2, 3, and 4. The high-order preservation is used to distinguish the variable-length encoding.

  1. For symbols with only one byte, the first byte is set to 0, and the next seven bits are the Unicode code for that symbol. At this point, utF-8 encoding and ASCII encoding are the same for English letters.

  2. For symbols of n bytes (n > 1), the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are set to 10. The rest of the unmentioned bits are the Unicode code for this symbol, as shown in the following table:

Unicode code point range (hexadecimal) Utf-8 encoding (binary) The number of bytes
0000 0000 ~ 0000 007F 0xxxxxxx A byte
0000 0080 ~ 0000 07FF 110xxxxx 10xxxxxx Two subsections
0000 0800 ~ 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx Three bytes
0001 0000 to 0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Four bytes

For example, the code point of the Chinese character “ugly” is 0x4E11 (0100 1110 0001 0001) in the range of 0000 0800 to 0000 FFFF in the third line of the above table. Therefore, “Ugly” needs to be encoded in the form of three bytes:

Here, the three 1’s in the first byte of the highest bit mean that the character is three bytes long, and the empty 16 x’s start with the last “ugly” bit and work their way back into the format, with the extra bits filling in the 0’s. The result is that the utF-8 code for “ugly” is 11100100 10111000 10010001, which translates to E4B891 in hexadecimal.

Decoding utF-8 encoding is easy. If the first byte is 0, the byte is a single character. If the first digit is 1, the number of consecutive 1’s will indicate how many bytes the current character occupies. “ugly” has three 1’s, which means three characters, and then it takes out the significant bit.

UTF-16

Utf-16 uses a variable-length 2 – or 4-byte encoding scheme.

Unicode1.0 was originally designed as a pure 16-bit code with 65,536 code points (U+0000 ~ U+FFFF) in the hope of representing all modern characters, but over time it became clear that 16 bits was not enough for computers. The result is today’s 4-byte UTF-16 encoding. At this point, Unicode has 1,114,112 code points (U+10000 ~ U+10FFFF), which is what we described earlier.

The code point ranging from U+0000 to U+FFFF is called BMP (Basic Multilingual Plane). The expanded range U+10000 ~ U+10FFFF is called SP (Supplementary Planes). Utf-16 encodes characters using BMP proxies.

What is agency?

The purpose of the proxy is the same as in UTF-8, which is to enable variable-length encoding.

What is a proxy area?

The agent area consists of two special ranges (the free part in BMP) of Unicode code points, with a total of 2048 positions, which are divided into high agent area (D800 — DBFF) and low agent area (DC00 — DFFF) of 1024 each. These two areas can form a two-dimensional table. There are 1,024 x 1,024 = 1,048,576 = 16×65536 cells, so it can represent exactly all the characters in the 16 bits of the proxy (supplement).

By switching from one dimension to two dimensions, the space is increased and utF-16 has a way to get extra code points.

The code of a high Surrogate area (Lead, row, above) plus a low Surrogate area (Trail, column, above) makes a Surrogate Pair. You can see some examples of transformations in the figure, such as

(D8 00 DC 00) — >U+10000, top left, first supplementary character

(DB FF DF FF) — >U+10FFFF, lower right corner, last supplementary character

What is the algorithm for converting from UTF-16 to character codes?

Divided into two parts:

  1. BMP direct correspondence, do not need to do any conversion;

  2. In the supplementary plane SP, corresponding calculations need to be made. In fact, as can be seen from the table above, the code points are arranged from top to bottom and left to right, so a simple division is needed to get the divisor and remainder to determine the row and column.

    You take a yard point, you subtract 10,000, you divide by 400 (=1024), you get the row, you get the remainder, you get the column, you add the starting value of the row and the starting value of the column, you get the proxy.

Note that the most commonly used characters are still encoded in the BMP plane. The following table describes the utF-16 encoding values in the range of each code point:

Unicode code point range UTF – 16 coding
U+0000.. U+D7FF BMP, a byte, do not convert
U+D800.. U+DFFF BMP free zone
U+E000.. U+FFFF BMP, a byte, do not convert
U+10000.. U+10FFFF Two bytes: high byte + low byte

The following is a concrete example to calculate the agent pair with code point U+1D11E:

Therefore, the agent pair corresponding to code point U+1D11E is D834 DD1E. The following table lists the utF-16 encoding of other characters:

character Code points Unicode binary encoding Utf-16 code unit Utf-16 hexadecimal bytes
$ U+0024 0000 0000 0010 0100 0000 0000 0010 0100 0024
euro U+20AC 0010 0000 1010 1100 0010 0000 1010 1100 20AC
[𐐷 U+10437 0001 0000 0100 0011 0111 1101 1000 0000 0001 1101 1100 0011 0111 D801 DC37
𤭢 U+24B62 0010 0100 1011 0110 0010 1101 1000 0101 0010 1101 1111 0110 0010 D852 DF62

As with UTF-8, utF-16 also has a binary to actual coding unit mapping table for each code point range, as follows:

Codepoint range Unicode binary Utf-16 encoding (code unit)
U+0000 ~ U+D7FF, U+E000 ~ U+EFFF 00000xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
U + 10000 ~ U + 10 FFFF Uuuuuxxxxxxyyyyyyyyyy 110110wwwwxxxxxx 110111yyyyyyyyyy

Code points in the range of U+10000 ~ U+10FFFF can be coded with the first and third byte high positions set to 110110 and 110111, respectively, and then filled in according to Unicode binary digits (where, Here’s the thing.

This is the end of this post, and if you have any questions, please contact me directly or discuss them in the comments section of the blog.

Refer to the article as well as derivative reading

Programming with Unicode

Mapping codepoints to Unicode encoding forms

Character sets and encodings (iii) – Fixed and variable length

Character sets and Encodings (iv) — Unicode

wiki-UTF-16

General questions, relating to UTF or Encoding Form

Detail and origin of character encodings (UNICODE,UTF-8,GBK)

Character encoding notes: ASCII, Unicode and UTF-8

Unicode Character Set and UTF-8, UTF-16, UTF-32 Encoding

Unicode website

Unicode Chinese Character coding table