This article was first published on wechat public account: Programmer Georgi

Georg: First of all, what are Unicode and code points

What is Unicode?

The figure below from www.unicode.org/standard/Wh… In the screenshot

The Unicode encoding defines the numeric representation of almost every character in the world (that is, the characters you look at, such as ABC, Chinese characters, etc.), and Unicode is compatible with many older encoding specifications, such as the familiar ASCII code.

What is a code point?

Each person in our country has a unique ID number, and Unicode issues an ID card for each character, which is identified by a unique string of numeric ids.

This string of numbers is unique throughout the computer world, and Unicode gives this string of numeric ids a name [codepoint].

How are code points represented?

Here’s how code points are represented:

U+XXXXXX is the representation form of code point, X represents a 16-digit number, which can have 4-6 digits. If there are less than 4 digits, 0 will make up 4 digits. If there are more than 4 digits, it will be several digits.

The ASCII code for character A is U+0041 (16×4+ (16^0)×1 = 65). The ASCII code for character A is U+0041 (16×4+ (16^0)×1 = 65). Wait, the Chinese character for “you” is “U+4F60″…

This web site is the magic ~

www.fileformat.info/info/unicod…

Search in the input box 1, and the result 2 is the Unicode code point representation of this character. Not only that, but result 2 can also be continued to see more details!

Let me click result 2 to show you:

For the website: www.fileformat.info/info/unicod…

You can see very detailed character details.

Georg: Let’s say I replace the Unicode code point in this url with DC00 and see what happens

http://www.fileformat.info/info/unicode/char/dc00/index.htm

You can see that it doesn’t have any code points, but rather suggests that “Non Private Use High Surrogate, First”, which translates to Surrogate region. This refers to the three encodings of Unicode (in other words, how code points are converted to UTF-8 or UTF-16 or UTF-32). Utf-16 uses the concept of a surrogate range.

Value range of code points

The range of code points is currently between U+0000 and U+10FFFF, and the theoretical size is 10FFFF+1=110000.

16 mechanism ~ the last 1 represents 65536 (16 to the fourth power), because it is hexadecimal, so the first 1 is 16 times of the last 1, so there is a total of 1×16+1=17 65536, roughly estimated 17× 60,000 = 1.102 million, so this is a million level number.

In order to better classify and manage such a large number of code points, every 65,536 code points are regarded as a plane, a total of 17 planes.

And we say the agency area is in the plane, and the plane has a lot of attention. To help you understand the agency area, let’s talk about the floor plan

Flat, BMP, SP

What is a plane?

As can be seen from the above, the whole range of code points can be divided into 17 65536 parts, each part of which is a Plane. Numbering starts at 0, and the first Plane is called Plane 0.

The image below is from rishida.net/docs/unicod…

What is BMP?

The first Plane is Basic Multilingual Plane (BMP), also called Plane 0, whose code point range is U+0000 ~ U+FFFF. This is also the plane we use most, and most of the characters we use in our daily life fall in this plane.

The first colorful plane in the image above is the BMP.

Utf-16 requires only two bytes to encode characters in this plane.

BMP, the most commonly used, also has more than 60,000 code point space, if these characters are put on a picture, what will happen? GNU Unifont made one such image. See unifoundry.com/pub/unifont…

Here’s a shortened version of it:

What are supplementary planes?

The subsequent 16 Planes are called Supplementary Planes (SP). Obviously, these code points are already above U+FFFF, so they exceed the theoretical upper limit of 16-bit space, and UTF-16 encodes four-byte characters for these in-plane characters.

The agent area

You may also have noticed that there is a blank space in the BMP thumbnail in front of you. What is this white light that blinds our ape eyes? This is known as a Surrogate Area.

You can see that this space goes from D8 to DF. The red part D800 — DBFF in front belongs to the High Surrogate Area, and the blue part DC00 — DFFF in the back belongs to the Low Surrogate Area. Their respective sizes are 4×256=1024.

Non-private Use High Surrogate, First, dc00 is in D800 — DBFF’s Surrogate zone

How to encode UTF-16 with proxy region?

Utf-16 is a variable-length 2 – or 4-byte encoding scheme. Characters in the BMP are encoded with 2 bytes, and others are encoded with 4 bytes in so-called proxy pairs.

In the preceding aerial view, we see a blank Surrogate Area, 2048 spots reserved for utF-16 to encode characters in the Surrogate plane, They are divided into high proxy region (D800 — DBFF) and low proxy region (DC00 — DFFF), each of which is 1024. These two regions form a two-dimensional table, with a total of 1024×1024=210×210=24×216=16×65536, so it can exactly represent all characters in the added 16 planes.

The following image is from wiki

What is an agent pair?

A encoding of a high Surrogate region (Lead and row in the figure above) and a low Surrogate region (Trail and column in the figure above) is a Surrogate Pair. The sequence of two highs, two lows, or lower highs is illegal.

You can see some examples of transformations in the figure, such as

(D8 00 DC 00) — >U+10000, top left, first supplementary character

(DB FF DF FF) — >U+10FFFF, bottom right, last supplementary character

So why does UTF-16 use proxy pairs?

At the beginning, the fixed-length two-byte scheme was adopted, but it could not meet the capacity growth, because two bytes are only 216 = 65536, and our Chinese characters are more than 65536. What should we do then? Expansion of bai ~

However, switching to 4 bytes solves the capacity problem, but it also causes an efficiency crisis. For example, A character A is enough to save with 1 byte, but if you have to save with 4 words, the previous 1 gb · file may need 4 gb to save now, doesn’t it cost money

So what do we do? Therefore, each big bull created his own coding scheme, in an attempt to strike a balance between efficiency and capacity, one of the big bull established utF-16 coding scheme!

If you look at the figure below, you can see that the encodings are not incremental, and 70-89 encodings have no corresponding characters.

Here, code points between 70 and 89 are dug out to form a coding space of 10×10 in horizontal and vertical directions, so that another 100 coding Spaces can be extended. We lost 20 of our 100 bits, because 70 to 89 is 20 bits, and that’s not part of the code, so we lost 20 bits, right

But those 20 codes add another 100 code Spaces by forming proxy pairs, 80 more each time. Such a lengthening method is used by UTF-16.

Xiaomen: UTF-16 is equivalent to sacrificing the space of the high agent area (D800 — DBFF) and the low agent area (DC00 — DFFF), but adding 10241024=1665536 space. In turn to achieve expansion!

How to convert code points to UTF-16?

George: Continue with the last example. The transformation is divided into two parts:

1. In BMP, the utF-16 encoding of U is the corresponding 16-bit unsigned integer if U<0x10000.

2. In supplementary plane SP, corresponding calculation is needed. That’s if U is greater than or equal to 0x10000

We calculate U’=U-0x10000 and write U’ in binary format: YYYY YYYY YYXX XXXX XXXX. The UTF-16 encoding for U (binary) is: 110110YYYYYYYYYY 110111XXXXXXXXXX.

The Unicode code 0x20C30, minus 0x10000, is 0x10C30, written in binary: 0001 0000 1100 0011 0000. Replace y in the template with the first 10 bits, and x in the template with the last 10 bits, and you get 1101100001000011 1101110000110000, which is converted to hexadecimal 0xD843 0xDC30.

Note: the above calculation method is only used to illustrate the conversion principle, and does not represent the actual calculation method.

UTF-32

We say the maximum 10FFFF is just 21 bits, whereas utF-32’s fixed-length four bytes are 32 bits, so it’s more than enough to represent all code points without any pressure, so just fill up the representation of code points with the previous complement of zeros. The biggest drawback to this representation is that it takes up too much space.

Let’s look at utF-8, which is a little more complicated.

UTF-8

The benefits of utf-8

Xiao Meng: It is simple but at least it is a kind of code, as shown in the figure below.

Coding Scheme 1 character
0 h
1 e
2 l
3 a
4 v
5 z
6 y
7 i
. .

The idea of your scheme is very nice, it tries to grow naturally with the numbers, it can still be coded, but it is difficult to decode.

As you can see, your scheme will not work because the low bits are “squeezed dry”, making it impossible to distinguish between single bits and multi-bits.

The coding scheme 2 in the figure below is my improvement scheme.

This is my second coding scheme, since the previous one is indistinguishable, I will make the low level space, 5 and above do not use 5,6,7… I’m not going to use any of this code up to 49, I’m going to jump to 50. Then introduce a variable length decoding rule:

Scan from left to right, read below 5 digit decode by single bit; When reading a number of 5 or more, decode the current number and the next two digits together.

Look at an instance

0 and 1 are up to 5 (digits up to 5 are decoded as a single digit), so he is decoded, while 5 (digits up to 5 or more are read, the current digit and the next digit are readThe two togetherRead and decode. , then 5 and 3 together is 53, check the code table 53 is “you”, this scheme avoids ambiguity.

Georg: This is still a very crude design, if we want to search for the character “O” in this string, its code is 3, we will find 3 and 53 first, and then we will match 3 of 53. This design will make it difficult to implement the matching algorithm. .

In fact, the key is to use high reserved bits to distinguish, the disadvantage is less effective coding space

Utf-8 is a variable-length encoding scheme that can have 1, 2, 3, and 4 byte combinations. Utf-8 uses high reserve to distinguish between different variances as follows:

As you can see, the multibyte does not contain a one-byte pattern because of the highest bit difference. For UTF-8, the two-byte mode is not included in the three-byte mode, nor is it included in the four-byte mode; The three-byte pattern is also not in the four-byte pattern, which solves the search matching problem described above.

As you can see, because of the difference between the zeros and ones on the fixed bits, the two bytes are not identical to either the first or last two bytes of the three bytes.

When searching, there is no overlap between the two – and three-byte codes, because the highest bits are different. There’s just less effective coding space.

How does UTF-8 convert to code points

Unicode encoding (hexadecimal) Utf-8 byte stream (binary)
000000-00007F 0xxxxxxx
000080-0007FF 110xxxxx 10xxxxxx
000800-00FFFF

1110xxxx 10xxxxxx 10xxxxxx
010000-10FFFF 11110xxx10xxxxxx10xxxxxx10xxxxxx

The Unicode encoding first determines its scope by finding the corresponding bytes.

For characters between 0x00-0x7f, utF-8 encoding is identical to [ASCII encoding].

The Unicode encoding for “han” is 0x6C49. 0x6C49 Between 0x0800-0xFFFF, use 3-byte template: 1110XXXX 10XXXXXX 10XXXXXX. Write 0x6C49 as binary: 0110 1100 0100 1001, and replace the x in the template with this bit stream in turn to get: 11100110 10110001 10001001, i.e. E6 B1 89.

This article was first published on wechat public account: Programmer Georgi

If you are a toutiao user, you can get 59998 yuan worth of programming and postgraduate entrance examination materials in the background of my toutiao number programmer Georgi reply resources. If you think the article is good, welcome to follow my WX public number: I am a background development engineer of BAT Factory, focusing on sharing technical dry products/programming resources/job interview/growth feelings, etc., paying attention to sending 5000G programming resources and organizing an offer to help many people to win Java with the answer attached, free download CSDN resources.