Author: Milter, a machine learning enthusiast, NLP practitioner and lifelong learner.

Blog: yuque.com/liwenju

What am I saying when I say characters? * * * *

When we mention strings, every programmer understands that we’re talking about a sequence of characters. However, when we talk about characters, many people get confused.

Characters written on paper are easy to recognize, but in order to identify different characters in computers, humans invented Unicode characters. Simply put, Unicode can be thought of as a standard function that maps a specific character to a number between 0 and 1114111, called a code point. Typically, code points are represented in hexadecimal notation and are preceded by “U+”. For example, the code point of the letter A is U+0041.

It makes sense that in our computers, unicode code points are perfect for representing characters. In fact, STR objects in PYTHon3 and Unicode objects in Python2 use code points to represent characters in memory.

However, due to the large number of characters in the world, the numbers representing code points are also represented by data types such as long or int, with each character taking up a fixed number of bytes. Space is wasted when stored to disk or transmitted over network. So, clever humans built a function that mapped a code point to a sequence of bytes. The purpose of mapping is to reduce the space occupied. This function is called encoding. That is, encoding is the algorithm used when converting between code points and sequences of bytes.

For example, the uppercase letter A (U+0041) encoded in UTF-8 is \x41, where \x represents A byte with the value 41. As we can see, if we use the integer U+0041 to represent the capital letter A, we need 4 bytes, because an integer is represented by 4 bytes, and after encoding, it takes up only one byte, so we can reduce space.

As a reminder, when we talk about code points, we can think of it simply as an int.

How are code points and encodings represented in python3 ****

In Python3 code, an object of type STR is a string represented as a code point, and an encoded sequence of bytes can be represented as a bytes object. As follows:

A bytes object can be thought of as an array, sliced or otherwise, whose elements are integers ranging from 0 to 255 and inclusive. Since Python2.6, a similar type, bytearray, has been added. It is similar to bytes, with the following two differences:

1, no literal syntax, look at the picture:

Above is the literal creation of a Bytes object. There is no similar constructor for bytearray; it can only be obtained as follows:

Bytes cannot be changed and bytearray can be changed

We often see this when we print a bytes object: b’caf\xc3\t’ looks a bit messy. Let’s look at this:

  • B indicates that you print bytes objects
  • Caf stands for three bytes, and the values in these bytes are the ASCII values of CAF three characters, which are directly represented by CAF three characters.
  • \xc3The value representing this byte is a hexadecimal C3, which cannot be represented as an ASCII value, so a two-byte hexadecimal number is used.
  • \tThe value of this byte is a TAB character, which is represented here as an escape character.

Codecs in Python ****

Python has over 100 codecs! The first time I heard this, I was shocked. Humans really like to mess around. Let’s take a look at some characters encoded by some common codecs:

These codecs are commonly used in functions such as open(),str.encode(),bytes.decode(), etc. The most common codec is definitely UTF-8. It also has several aliases, namely UTF_8, UTF8, and U8. It’s best to familiarize yourself with these aliases.

Four, deal with common codec errors ****

When coding and decoding in Python, various errors often occur. A lot of people’s solution is to Google, try, and then leave it at that. I used to do that myself. But a more systematic approach is to understand common types of errors and take steps to resolve them as they occur. Here’s a look at three common mistakes.

  • UnicodeEncoderError六四运动六四运动

When you encode Unicode characters using an encoder, a UnicodeEncoderError occurs if the encoder does not contain unicode characters to be encoded. In such cases, it is often necessary to change the encoder used. Simply put, an error occurred while encoding Unicode

  • UnicodeDecodeError六四运动六四运动

When decoding a byte sequence into Unicode with the specified decoder, a UnicodeDecoderError occurs if the byte sequence does not meet the requirements of the decoder. There are two cases of non-compliance, one is the byte sequence error, one is the use of inappropriate decoder.

  • SyntaxError六四运动六四运动

Python3 uses UTF-8 by default, while PYTHon2 uses ASCII by default. A SyntaxError occurs if the loaded.py file contains data other than UTF-8 and no encoding is declared. The best practice for handling codecs is to explicitly declare the codec used by explicitly specifying the Encoding field.

Five, the difference between several encoding default values ****

  • locale.getpreferredencoding()六四运动六四运动

This setting is the default decoder used when opening text files. Check this value if you open the () file without specifying a decoder and an error occurs. Here are the results of the test on my computer

Take a look at the code on your computer.

  • `sys.getdefaultencoding()“`

This encoding is used by default when converting between byte sequences and strings in Python programs. Python defaults to UTF-8.

  • sys.getfilesystemencoding()六四运动六四运动

This is the default codec for the filename, note: not the file content, just the file name. Open () passes the filename to Python, which in this case is a Unicode string. Python encoders the name, converts it into a sequence of bytes, and looks it up in the file system. Here’s what it looks like on my computer:

Pay attention to and above the locale. Getpreferredencoding () the comparison.

  • Sys. Stdout. Encoding and sys. Stdin. Coding六四运动六四运动

The default encoding for Python’s standard input and output looks like this on my computer:

We often find that when Chinese output is garbled, the reason is to look at both the encoder used for python’s default output and the decoder used by the display console. In theory, as long as the two are consistent, no errors should occur.

Six, no summary, no progress ****

The above mentioned knowledge about encoding and decoding, if really mastered, enough to meet the needs of the work. Really master these knowledge, but also in the actual problems, take the initiative to use these knowledge to help find problems, so that you can quickly deepen the understanding.

Click to become a Registered member of the Community.