Character encoding problems are encountered not only in Python, but in all aspects of information processing. After the Tower of Babel, there was a gap in human language communication, let alone abstract writing. Elusive coding problems, become programming road block, so that the problem of programmers scratching their heads: exhausted all coding, decoding functions, still can not solve; Occasionally they succeed in solving the problem, but they don’t know why. \

The tower of Babel humans united to build a tower leading to heaven. In order to stop the plan of humans, God made them speak different languages, so that humans could not communicate with each other, so the plan failed, and humans separated from each other.

In Python2, the unsavory STR type acts as both a string and a sequence of bytes. Fortunately, Python3 has changed: STR is used only to represent strings, array is used to represent sequences of bytes encoded by characters, and Python uses UTF-8 as the default encoding. This greatly simplifies the coding process.

However, we still often encounter EncodeError and DecodeError problems in programming, which is because we do not have a deep understanding of coding problems. This paper attempts to trace back the history of the generation and development of coding, and tries to sort out a thread in the “wool ball of coding problems” to solve this untimely bomb.

Coding problems are rooted in the diversity of human languages, which have evolved with the spread of computer technology. Let’s start with ancient Computers.

English – ASCII

The so-called coding (code table, character coding) is the unified numbering of the characters and symbols used in a language, forming a table, each character has a number corresponding. Instead of processing specific characters, the computer processes the number.

Computer originated from the United States, English only uses 26 letters, counting the case is only 52 characters, the number is very small, easy to code. We all know that computers use binary operation and storage, an 8bit byte can represent 2^8=256 cases, we can easily establish a one-to-one correspondence between English letters and numbers. This is ASCII.

ASCII code does not use the highest bit of a byte, uses only 7 bits of space, is numbered 00-7F, and stores 128 characters and control characters. The ord function in Python that returns the code point for a character.

>>> print([ord(c) for c in 'abAB'[])97.98.65.66]
Copy the code

It can be seen that letter A corresponds to number 97, and letter B corresponds to number 98…… And so on. ASCII code is the perfect solution to the problem of the Representation of English letters in computers.

But as computers spread from the United States to all parts of the world, the problem became more complicated.

Other alphabetic languages – extended ASCII

What if other alphabetic languages, such as French, Russian, and Spanish, use characters different from English? The good news is that ASCII uses only 128 of the seven bits. Using all eight bits, you can extend it by another 128 characters, which is enough space for most alphabetic languages.

The first 128 bits of each alphabetic language are the same as ASCII, and the last 128 Spaces are freely defined.

The problem, however, is that character encodings greater than 127, extended by each language, refer to different characters in different languages. Code 130, for example, refers to three completely different characters in German, Turkish and Russian. However, in the following program, the three values in list codes are the codes of the three languages mentioned above.

b = bytes([0x82]) #0x82= =130
codes = ['cp273'.'cp857'.'cp866']
print([b.decode(code) for code in codes])
Copy the code

The output [‘ B ‘, ‘e ‘,’ в ‘], which is different after 127, produces garbled code: a document written in Russian code that, on a Turkish computer, is decoded entirely in Turkish code into a meaningless string of Turkish characters.

The only solution to garbled code is a “big code” that contains all the language characters.

Although alphabetic languages have the problem of garbled characters, it has been solved by extending ASCII characters. If you use it only in your own country, or if you simply attach a specific encoding name to the file, this is fine.

It is possible to solve the problem of alphabetic languages with only one byte, but it is more troublesome for Chinese characters, Japanese and traditional characters that require a large number of characters.

Chinese character — GB series code

The industrious and brave Chinese people probably use more than 100,000 Chinese characters in total, and tens of thousands of characters are commonly used. One byte, 256 storage space, obviously not enough.

The simplest and most crude way to solve this problem is to increase the representation space, one byte for ASCII code; Two represent Chinese characters; Two is not enough, three, three is not enough, four… This scheme maintains ASCII compatibility while encoding tens of thousands of Chinese characters in multi-byte space.

The commonly used encoding schemes of Chinese characters are as follows:

  • GB2312, double byte representation, 1981, 6763 Chinese characters
  • GBK gb extension, double byte, expanded, compatible with GB2312, 1995, 21003 Chinese characters
  • GB18030 GBK expansion, single, double, four bytes, 27484 words, compatible with GB2312, GBK
  • BIG5 Taiwanese Traditional Chinese Character Set, double-byte, 1984, 13,053 Chinese characters

Because of the use of multiple bytes, the above referred to as multi-byte encoding MBCS, as a special case, double-byte encoding called DBCS.

Let’s see how GB2312, GBK and GB18030 encode Chinese characters. The encode function in Python encodes characters into a sequence of bytes based on a given encoding.

c='gong'
codes=['gb2312'.'gbk'.'gb18030']
print([c.encode(code) for code in codes])
Copy the code

[b’\ XB9 \xae’, B ‘\ XB9 \xae’, B ‘\ XB9 \xae’] It can be seen that the Chinese character gong is encoded as double byte 0xb9AE in the three encodings, because the GB series is compatible.

At this point, the problem of encoding and expanding Chinese characters has been solved, and the GB series we commonly use can be convenient for encoding Chinese characters.

But there are still problems, and when we communicate with Taiwan, there are new problems:

>>> c.encode('big5') 
Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'big5' codec can not encode character '\u5de9' in position 0: illegal multibyte sequence
Copy the code

BIG5 cannot encode simplified Chinese characters. In China, Japan and South Korea and Taiwan alone, there are many encoding schemes, and these multi-byte encoding schemes can communicate with each other and still cause garble problems.

The “big code” requirement is imminent.

conclusion

  • Simple from ASCII, but does not support other countries.
  • To extend ASCII support for other languages, but generate garbled characters.
  • MBCS multi-byte encoding can support Chinese, a language with many characters, but too many languages are incompatible with each other.

Gong Qingkui, Da Kui, interested in computer and electronic information engineering. gongqingkui at 126.com

Appreciate the author

Read more

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

5 minutes to quickly master the Python timed task framework \

Special recommendation \

\

Click below to read the article and join the community