Characters and bytes: all data stored by the computer is composed of a series of 01 bit sequences, including text characters, images, audio and video, software and so on. An 8-bit 01 bit sequence constitutes a byte. A character is a symbol, such as a Chinese character, a letter, or a punctuation mark. In layman’s terms, bytes are the language of computers and characters are the language of people. If you want people and machines to communicate, you need to encode and decode. The general routine of encoding and decoding is probably: give a unique ID to the target set to be encoded (characters, or any other symbols, resources, etc.), save the binary corresponding to this ID, and so on.


What is ASCII, Unicode, UTF-8, GBK

Here is a quote from the Tang Dynasty in the flourishing period of Zhihu. Personally, I think it is quite easy to understand.

  • Through the expansion and transformation of ASCII code, The Chinese people created GB2312 code, which can represent more than 6,000 commonly used Chinese characters.
  • There are too many Chinese characters, including traditional and various characters, so the GBK code, which includes the code in GB2312, but also expanded a lot.
  • China is a multi-ethnic country, each nationality almost has its own independent language system, in order to express those characters, continue to expand GBK code for GB18030 code.
  • Every country codes its own language like China, so there are all kinds of codes, and if you don’t install the code, you can’t interpret what the code is trying to say.
  • Finally, an organization called ISO gave up. They created a codeUNICODEThe code is so large that it can hold any text or symbol in the world.So as long as the computer has UNICODE encoding system, no matter what kind of characters in the world, just need to save the file, save into UNICODE code can be interpreted by other computers.
  • UNICODE in network transmission, two standards, UTF-8 and UTF-16, transmit 8 and 16 bits at a time, respectively.
  • With UTF-8, why are there so many people using GBK and other codes in China? Because utF-8 and other codes are large in volume and occupy a lot of space, GBK and other codes can also be used if the majority of users are Chinese.

In a word, Unicode, GBK and ASCII are both character sets used to represent characters (symbols). ASCII is the most basic single-byte character set, which can be stored and transmitted directly. Utf-8 is just one standard, one implementation, for the Unicode character set.

What is the difference between reference Unicode and UTF-8? – Answer of tang Dynasty – Zhihu

If I were to draw a picture, it would look something like this:

Two, encode and decode which one should be used

Encode is the process of converting characters into sequences of bytes. Decode, on the other hand, is the process of converting sequences of bytes into characters. The two are a reciprocal process, encoding for storage transmission, decoding for display reading.

1. Why is Python coding so painful?

There are two reasons.

One is because Python2 uses ASCII as the default encoding, and ASCII cannot handle Chinese. So why not use UTF-8? Guido released the first version of Python in February 1991, while Unicode was released in October 1991. Utf-8 was not born when Python was born.

The second reason is that Python2 has two string types, STR and Unicode, so people are often confused.

2. STR and Unicode

Python2 has both STR and Unicode strings. STR is essentially a sequence of bytes, and as you can see from the following example code, the “heart” of STR type prints out in hexadecimal ‘\xe5\ XBF \x83’.

>>> sys.stdin.encoding
'UTF-8'
>>> sys.stdout.encoding
'UTF-8'
>>> s = 'the heart'
>>> type(s)
<type 'str'>
>>> s
'\xe5\xbf\x83'
>>>
Copy the code

The Unicode “u” heart, by definition, is u’\ board 3′.

>>> u = U 'heart'
>>> type(u)
<type 'unicode'>
>>> u
u'\u5fc3'
>>>
Copy the code

The experimental environment was UTF-8. So the above three bytes ‘\xe5\ XBF \x83’ are the UTF-8 encoded bytecode of the “heart”. Here is a brief explanation of this transformation relationship.

The heart’s Unicode code is U ‘\ cut off C3 ‘, u + 5FC3, between U+0800 and U+FFFF, We replaced the binary code 0101 1111 1100 0011 of 5FC3 successively to the X position in 1110XXXX 10XXXXXX 10XXXXXX, and made up the insufficient position with 0. We end up with a string of binary data 11100101 10111111 10000011, i.e. 11100101 10111111 10000011, i.e. \xe5\ XBF \x83.

3. Selection of encode and decode

To save Unicode symbols to a file or transfer them to the network, Python provides an encode method to convert them to STR, and otherwise, decode.

encode

>>> u = U 'heart'
>>> u
u'\u5fc3'
>>> u.encode('utf-8')
'\xe5\xbf\x83'
>>> 
Copy the code

decode

>>> s = 'the heart'
>>> s
'\xe5\xbf\x83'
>>> s.decode('utf-8')
u'\u5fc3'
>>> 
Copy the code

In the utF-8 environment, the string S is encoded in UTF-8 by default, that is, the bytecode of S is consistent with the result of the U.nCode (‘ UTF-8 ‘). Therefore, we can get the following figure.

Anyway, again, encode is the process of converting characters (symbols) into binary data, so unicode to STR is converted using encode and vice versa.

encoding always takes a Unicode string and returns a bytes sequence, and decoding always takes a bytes sequence and returns a Unicode string.

In addition, as we know from the relationship between characters and bytes in section 1, Unicode encodings can represent all character sets, including ASCII and GBK. Therefore, when using encode/decode must correspond to the corresponding character set, otherwise there may be UnicodeEncodeError and UnicodeDecodeError.

UnicodeEncodeError and UnicodeDecodeError are analyzed for example

UnicodeEncodeError

This occurs when a Unicode string is converted to a STR byte sequence. For example,

>>> u = U 'heart'
>>> u.decode('gbk')
Copy the code

The error log

UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\u5fc3′ in position 0: ordinal not in range(128)

Why does a UnicodeEncodeError occur?

Because decode is the process of converting binary data into symbols, unicode strings need to be converted into binary data first, implicitly calling the encode method and using Python’s default ASCII code for encoding. The ASCII character set contains only 128 characters, not Chinese characters, so this error occurs.

To resolve this error, you must specify the correct encoding rules to use encode methods, such as GBK.

>>> u = U 'heart'
>>> u.encode('gbk')
'\xd0\xc4'
>>> u.encode('gbk').decode('gbk')
u'\u5fc3'
>>> 
Copy the code

Where ‘\xd0\ xC4 ‘is the code of the character “heart” in the GBK character set (also can be understood as the number ID).

UnicodeDecodeError

This occurs when a STR byte sequence is converted to a Unicode string. For example,

>>> s = 'the heart'
>>> s.encode('gbk')
Copy the code

The error log

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe5 in position 0: ordinal not in range(128)

Why UnicodeDecodeError?

Because encode is the process of converting symbols into binary data, STR type byte sequences need to be converted to symbols first, implicitly calling the decode method, and using Python’s default ASCII code to decode. The ASCII character set contains only 128 characters, not Chinese characters, so this error occurs. Is it ok to use GBK to decode? For example, simply execute S.de (‘ GBK ‘). The answer is no.

To resolve this error, you must specify the correct encoding rule to use the decode method, such as UTF-8.

>>> s = 'the heart'
>>> s.decode('utf-8').encode('gbk')
'\xd0\xc4'
>>> 
Copy the code

Reference blog.csdn.net/trochiluses… .

In the case of utF-8 and GBK encoding conversion, just follow the following figure to do the correct conversion. Essentially by virtue of the basic fact that characters are immutable and Unicode character sets include all characters of GBK.

In order to avoid UnicodeEncodeError/UnicodeDecodeError or gibberish, encode and decode method must be used in pairs, and adopt the correct character set and decoding rules. As a best practice, it is recommended to use the “sandwich” processing idea and only encode/decode operations at the “entry and exit” points.

To summarize, this article gives a rough introduction to the relationships between Unicode, GBK, ASCII, and UTF-8. The relationship between characters and bytes is clarified; This paper introduces the precautions of using Encode/Decode with examples. And an example to analyze the causes of the UnicodeEncodeError/UnicodeDecodeError; Some rules of experience are summarized. The most important, however, are the three images in the article.