Untangling coding problems with Python: Unicode universal code

Unicode –

In order to solve the problem of independent coding of languages, Unicode encoding scheme was proposed. This scheme is simple and crude: all the characters of languages in the world are encoded in the same way. The two Unicode schemes, UCS-2 and UCS-4, have up to 2^16 and 2^32 Spaces respectively: they should be sufficient before aliens visit Earth.

Let’s look at Unicode code points for a few characters:

ls = 'painted gong of abAB do'
print([ord(l) for l in ls])
Copy the code

Results: [97, 98, 65, 66, 24041, 9733, 9734]. It can be seen that the Unicode code point of the letter abAB is the same as its ASCII code point, so the character is compatible with both letters, while the Chinese character gong is 24041(0x5DE9), which is different from the previous GB series code 47534(0xB9AE), so the Unicode and GB series codes are not fully compatible: Only ASCII partially compatible.

When Unicode was adopted by people in all countries, the problem of extension and garble disappeared: all human language characters had a single code point, and each numeric code we wrote in communication had a unique character corresponding to it. The CHR function in Python returns the character corresponding to the Unicode code point.

>>> print([chr(i) for i in [123.957.24041[]])'{'.'argument'.'gong']
Copy the code

So can we use powerful Unicode encoding?

>>> ls = 'painted gong of abAB do'
>>> ls.encode('Unicode')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module> LookupError: unknown encoding: Unicode
Copy the code

Unknown encoding Unicode! This is because there is no such thing as a Unicode code. Unicode is just a table of code points that establishes a mapping between characters and integers. Unicode does not directly determine how entire code points are stored in bytes, whether they are stored in high or low values, or whether there are special flags. Instead, specific encodings take care of these details: UTF-32, UTF-16, and UTF-8.

Utf-32 is in four bytes

Utf-32, as the name implies, is an encoding scheme that stores a character in 32-bit, or four bytes.

>>> 'gong of aA'.encode('utf-32LE')
b'a\x00\x00\x00A\x00\x00\x00\xe9\x5d\x00\x00'
Copy the code

As you can see, all characters are stored in four bytes: each byte is not filled with \x00 except for Unicode code points. This method is straightforward, Unicode code points do not need to convert, directly fill. But a lot of \x00 causes a lot of waste.

Is there a solution to this waste? Can I compress it to two digits?

Utf-16 is a two-byte unit

When encoding in UTF-16.

>>> 'gong of aA'.encode('utf-16LE')
b'a\x00A\x00\xe9\x5d'
Copy the code

Two bytes are sufficient for most Unicode code points, and the system automatically uses four bytes if not. This is a system implementation, so we don’t have to worry about it. Utf-16 encoded byte sequences and characters can still be matched one by one. Utf-16le = utF-16LE; utF-16BE = UTF-16LE;

>>> 'gong of aA'.encode('utf-16BE')
b'\x00a\x00A\x5d\xe9'
Copy the code

The two are basically the same, except that the high and low byte positions are reversed. The suffixes LE and BE denote little endian and Big endian. This is the computer’s internal implementation details of whether MSB (heavy weight byte) bytes are placed at the beginning or end of a byte.

In Gulliver’s Travels, the Lillidians formed two military opposites, the Big Endians and the Little Endians, who fought each other many times to eat their eggs first.

So two bytes is the limit of Unicode encoding?

Utf-8 variable-length byte encoding

Can you store text in a variable number of bytes? If you store English text, you can use only one byte per character. Chinese characters, and then expand. This further saves storage space. The answer is yes, this is variable-length encoding UTF-8.

>>> 'gong of aA'.encode('utf-8')
b'aA\xe5\xb7\xa9'
Copy the code

This is currently the shortest sequence of bytes, because aA is stored as a single byte.

Note that in UTF-32 and UTF-16, the byte sequence for gong is 0x5DE9, but in UTF-8, the byte sequence becomes 0xe5b7a9. This shows that utF-8 encoding does not simply store Unicode code points directly into byte sequences, but rather performs some transformations. These conversions ensure that English uses a bit of storage, Chinese and other large characters multi-byte storage.

So how does that work?

Utf-8 encoding conversion rules

This section is too detailed and may be skipped.

Utf-8 implements variable-length encoding, which requires special templates in byte sequences for decoding. Utf-8 encoding follows the following rules:

Code points between 0x00-0x7f, compatible with ASCII codes, single byte directly stored in the following template 0*** ****
Between 0x80 and 0x7ff, two bytes are used for storage. The byte template is 110* **** 10** ****
Between 0x800 and 0xFFFF, three bytes are used for storage. The byte template is 1110 **** 10** **** 10** ****
0 x10000 x1fffff between 0, using four bytes of storage, byte template is 1111 * * * * * * * * 10 * 10 * 10 * * * * * * * * * * *

Take the Chinese character gong for example, whose Unicode code point is 0x6C49 and binary bit 110 1100 0100 1001. 1110 **** 10** **** 10** ****, use binary, fill in from right to left, fill in zero if not enough, get the result 1110 0110 1011 0001 1000 1001, The hexadecimal value is 0xe6 0xB7 0x89, so the byte sequence in the UTF-8 format is 0Xe6b789.

From the utF-8 encoding conversion details, let’s return to the length of the three UTF encodings.

UTF three encoding lengths

The above three encoding methods, due to different compression rates, lead to different file lengths. The following program compares the three encoding lengths when the text is Chinese and English:

es = 'abcdefghij'
cs = 'Don't worry about the road ahead without friends, the world who do not know you. '

codes = ['utf-32le'.'utf-16le'.'utf-8']

print([len(es.encode(code)) for code in codes])
print([len(cs.encode(code)) for code in codes])
Copy the code

[64, 32, 48] utF-8 has advantages over UTF-16 and UTF-32 encoding for English. For Chinese characters, the most advantageous encoding is UTF16. This is because most Chinese characters are stored in 2 bytes in UTF-16 encoding, while utF-8 requires 3 bytes.

In everyday life, UTF-8 is the most widely used for maximum compatibility.

At this point, we have evolved from ASCII to THE GB series to Unicode and the corresponding UTF series to have a character encoding system that is inclusive, garbled free and has high compression rates.

Is it ready to use? No! Because we encode the text itself, we don’t record which code was used: when we send a document, the other person doesn’t know what code to open it in unless we tell them.

To solve this problem, we will leave it to the next article.

conclusion

Unicode unifies the characters of the world’s languages. Among several encoding forms of Unicode;
Utf-32 is simple but wasteful.
Utf-16 uses two bytes for storage, saving space.
Utf-8 uses one byte of direct storage, which is efficient and space-balanced.

Gong Qingkui, Da Kui, interested in computer and electronic information engineering. gongqingkui at 126.com

Appreciate the author

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

5 minutes to quickly master the Python timed task framework \

Special recommendation \

Click below to read the article and join the community

Untangling coding problems with Python: Unicode universal code

Unicode –

Utf-32 is in four bytes

Utf-16 is a two-byte unit

Utf-8 variable-length byte encoding

Utf-8 encoding conversion rules

UTF three encoding lengths

conclusion

Related Posts

ElasticSearch learning documentation

AQS- Understanding Read and write Locks using ReentrantReadWriteLock

SpringBoot source series (2) : in-depth understanding of the boot principle