Chapter 7 Python String Encoding

First of all, the code is the code book, which records the correspondence between binary and text. The existing code book includes:

1. ASCII contains only English letters, numbers, and special characters and their binary correspondence. It is mainly used to display modern English and other Western European languages. 2. GBK: contains only the correspondence between Chinese characters (and English letters, numbers, and special characters) and 0101010. An English letter (number, special letter) is represented by one byte, whereas for Chinese, a Chinese character is represented by two bytes. Unicode: contains all characters in the world corresponding to binary 0101001. At first, Unicode specified two bytes for a character, later changed to four bytes for a character. 4, UTF-8: contains all the characters in the world and binary 0101001 correspondence. English --8 bits --1 byte; European characters -16 bits - two bytes for a character; Chinese - Asian -24 bits - three bytes.Copy the code

ASCII code: contains letters, digits, and special characters corresponding to 01010101. Each character is represented by one byte.

A 01000001 One byte per character.

GBK: contains only the mapping between local characters (and letters, digits, and special characters) and 0101010.

A 01000001 CHARACTERS in the ASCII code: each character represents one byte.

In 01001001 01000010 Chinese: One character represents two bytes.

Unicode: Contains the correspondence between all characters in the world and binary 0101001, a character represented by four bytes.

a 01000001 01000010 01000011 00000001

Medium 01001001 01000010 01100011 00000001

Utf-8: Contains the mapping of all literals in the world to binary 0101001 (at least 8 bits per byte for a character).

A 01000001 CHARACTERS in the ASCII code: each character represents one byte.

To 01000001 01000010 (European text: Portuguese, Spanish, etc.) one character two bytes.

Chinese 01001001 01000010 01100011 Asian characters; A character is represented by three bytes.

Use the built-in functions CHR and ORD to view the correspondence between numbers and characters

Ord: gets the encoding of the character. CHR: Obtains the corresponding character according to the encoding.

Print (ord('a')) # print(CHR (65)) # aCopy the code

After a brief review of the coding, we will popularize some knowledge points:

1. In computer memory, unicode encoding is used uniformly. When data needs to be saved to hard disk or transferred over network, it is converted to non-Unicode encoding such as UTF-8 encoding.

In fact, this need not be deeply understood, he is the rule, for example: When editing a file with a file editor (Word, WPS, etc.), convert your data (at this point your data is non-Unicode (utF-8 or GBK, depending on your editor Settings)) into Unicode characters and read them into memory and edit them accordingly. When saving, convert Unicode to non-Unicode (UTF-8, GBK, etc.) and save to a file.

2. Different codes cannot directly identify each other.

Take one of your stats: “There is nothing wrong with old iron” is encoded in UTF-8 encoding and sent to a friend, so you must send 01010101 binary converted into UTF-8 encoding. Then your friend receives the data you sent, and if he wants to view the data, he must convert 01010101 into Chinese characters before he can view it. It must also be reversed through UTF-8 encoding. If it is reversed through GBK encoding, the content may be garbled or error.

So with that out of the way, let’s move on to the most important part of coding.

Prerequisite: Python3X version (python2X version is different from this).
Main uses: Data storage or transmission.

As mentioned earlier, the unicode encoding is used in computer memory, and when data needs to be saved to hard disk or transferred over the network, it is converted to non-Unicode encoding, such as UTF-8 encoding.

Let’s take network transmission for example:

First of all, the point of knowledge is the ‘data’ mentioned here. This data, in fact, is exactly the string (special string) type of data. There are a lot of data types in Python, such as int bool list dict STR, etc. If I want to transmit a list data to Xiao Ming over the network, can’T I? No, you have to convert the list to a special string type before it can be transmitted, and the same goes for data storage.

So you know, the data you want to transfer over storage or network is a special string type, so I can just pass the string out, right? For example, if I have a number here: ‘Chicken at 10 tonight, good luck’, isn’t that a string? Can’t I directly send this data to Xiao Ming through the network? I can’t. One thing you don’t see clearly here is special strings.

So what’s the solution?

So what type is this bytes type? It is also one of Python’s basic data types: bytes.

The bytes type is almost identical to the string type. Look at the source code for the bytes type. Bytes can be operated in the same way as STR.

Bytes, also known as byte text, are used for data transmission and storage over the network. The bytes type is similar to STR and operates in a similar way, by adding a b to the beginning of the string. Why does Python need both types? Can’t I just use bytes?

This is inconvenient if you are developing with bytes only. This is because bytes are only displayed in hexadecimal for non-ASCII text. It’s very inconvenient.

S1 = 'Chinese' B1 = B '\xe4\ XB8 \xad\xe5\x9b\ XBD '# UTF-8 encodingCopy the code

Ok, so now that we have a general understanding of the bytes type, and the comparison between STR and bytes, we can now solve the final problem, which is the conversion between STR and bytes.

If you want STR data to be stored in a file or transferred out, you can’t do it directly, as shown above. Instead, convert STR data to bytes.

str —-> bytes

Encode ('utf-8') print(s1) # print(b1) # B '\xe4\xb8\xad\xe5\x9b\ XBD 's1 =' cn 'b1 = s1. Encode (' GBK ') # b'\xd6\xd0\xb9\xfa'Copy the code

bytes —> str

B = b'\xe4\ XB8 \xad\xe5\x9b\ XBD 's1 = b1.decode('utf-8') print(s1) #Copy the code

Then there is one of the most important problems, which you will often encounter in your future work, that is, GBK encoded data is converted into UTF-8 encoded data. Some people say, teacher, how can I be a little confused? What’s this? The bytes type is called byte text, and its encoding is non-Unicode, which can be GBK, UTF-8, or GB2312…..

B1 = b'\xe4\ XB8 \xad\xe5\x9b\ XBD '# This is utF-8 encoded Bytes type China B2 = b'\xd6\xd0\ XB9 \ xFA' # This is GBK encoded bytes type ChinaCopy the code

How to convert GBK encoded bytes to UTF-8 encoded bytes?

Different codes do not recognize each other directly.

As I said above, different codes can’t be directly related to each other, so they can be indirect. How indirect? Who does all the code in the world relate to? Are related to unicode, so you need to use Unicode to convert.

Just look at the picture below!

Related Posts

Optimized version of Mybatis interceptor for Handling Geometry type data

High frequency algorithm interview question (39) – Maximum path sum in binary tree

RocketMQ cluster setup details