This is the last article in the coding series, just python2.

Python2 has a number of puzzling coding problems compared to Python3, and it is these puzzling coding problems that we often criticize python2 for.

The normal coding problems we usually refer to are those encountered in python3 mentioned above

  • It is perfectly normal for a file or web page to be encoded differently from the way it is read, and the solution is very clear, as long as the correct encoding is used
  • withencodeanddecodeMethod encoding and decoding, using the error code caused by the error, which is also to find the correct code can be solved

Many of the encodings in python2 are confusing because they are not artificially encoded, but often the program encodes or decodes characters in a way that is not correct before you know it, resulting in errors or garbled characters. At this point, users will feel that they have not done the work of coding or decoding how can occur coding errors. Here’s an overview of why these problems occur, and I’m sure it’s easy to understand the problems in python2 after looking at the basics above.

These puzzling coding problems are mainly caused by two default codes

  • The default encoding for PY2 is ASCII, and py3 is UTF-8 (when will this default be encountered later). This is the main cause of coding problems
  • Py2 usedA = 'Chinese'The string defined is not really a string, it is not Unicode, so it is definedaBy default, an encoding conversion has been performed, which has nothing to do with ASCII yet and uses a different default encoding

The two default encodings above are almost never used in Python3 because of a design flaw in Pyhton2. Here are five examples of how Python2 causes coding errors.

1. Automatic encoding when defining STR

In python2, a=’ Chinese ‘is not Unicode, b=u’ Chinese’ is b. In python3 there is no such thing as u’ Chinese ‘, a=’ Chinese ‘and the a is Unicode.

To be exact, A is a byte and B is a string (as mentioned earlier, Unicode is exactly the same as a string). A is actually a binary byte stream, which is the product of the string encode. So in the assignment a=’ Chinese ‘, by default, we encode the string ‘Chinese’ into a binary number.

Since it’s code, it must be done in a certain way. This code is different in different scenarios. What is the current default encoding using the following code

import locale
locale.getdefaultlocale()
Copy the code

This is the second of the two default encodings.

The general result is (‘zh_CN’, ‘cp936’) on the Windows command line, even if CHCP 65001 is set, or cp936 is used in Jupyter.

Note When defining a string (or byte string, to be exact), it is encoded in GBK by default under Windows.

The two ways in which strings are defined confuse the user and make it easier to fall foul of the later minefields.

If you can’t understand how it was originally encoded, you’ll use the wrong code and you’ll get an error.

2. Error caused by mixing STR and Unicode

Due to the difference in article 1, a defined as a=’ Chinese ‘is also different from PY3 in terms of encoding and decoding.

  • In python3, a defined in this way can only be encode, because it is Unicode
  • In python2, this a is mainly decode because it is encoded

From this point, the first default encoding comes into play.

If STR is mixed with Unicode, such as’ Chinese ‘+u’ hao ‘, this process will default to decoding STR to Unicode, using py2’s default ASCII encoding, but Chinese does not correspond to ASCII encoding, so an error is reported. This is the first default encoding that produces an error, and there will be many other default conversions like this.

So first of all, how do I check what the default encoding is

import sys
sys.getdefaultencoding()
Copy the code

This is the command that results in ASCII in PY2 and UTF-8 in Py3. Both of the default encodings involved are present, and the viewing commands are different.

And a little bit more about how python3 is designed at this point. If STR and bytes are added directly, a TypeError is reported: Can’t convert ‘bytes’ object to STR IMPLICITLY, PYTHon3 says explicitly not to use the same method as PYTHon2.

3. Error caused by general non-standard PY2 encoding and decoding

There are two differences between py2 and py3

  • The two types in Py3, STR and bytes, correspond to Unicode and STR in Py2, respectively. Python2,A = 'Chinese'This a is the STR class,B = u 'Chinese'This B is the Unicode class
  • In PY2, both unicode and STR classes can use encode and decode methods, while in Py3 it is clear that STR can only use encode and bytes can only use decode

In PY2, STR should only be decode and Unicode should only be encode, but now all can be open, making it more confusing for users to use decode or encode, if the wrong error will appear as follows

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
Copy the code

The first error is to force unicode decode, the second error is to force encode STR. The internal operating mechanism for these two types of force-decoding codes is

  • When Unicode is forced to decode, because Unicode cannot be decode, Py2 defaults to using ASCII encode to decode unicode, which results in an error during encode
  • When STR is forced to decode, because STR cannot encode, py2 will decode STR with ASCII by default, and then use the result to encode, which will report errors during decode

This is the second case where the first default encoding produces an error

4. The default encoding for print

As mentioned earlier, a=’ Chinese ‘is itself a string of bytes, not a string, and the output of a typed in the console is not the same as that of print(a)

  • The inputaThe output of is in hexadecimal form, which is what it looks like
  • The inputprint(a)The output of ‘Chinese’ is the character we understand

In print, a default decoding is performed, converting bytes to Unicode. In this case, print is STR, and if the unicode type is passed in, unicode encode is automatically encoded as STR, and then print, as shown below

B = u'中文' print(b) # UnicodeEncodeError: 'ASCII' codec can't encode characters in position 0-1: ordinal not in range(128)Copy the code

This is the third case where the first default encoding produces an error.

5. The function parameter type is passed in incorrectly

Function arguments that should be passed in as STR type are also automatically encoded with ASCII encode when passed in as Unicode type

This is a generalization of print 4, STR and raw_input

Print (b) print(b) a = raw_input(b)Copy the code

The last three lines are specified separately will report an error, which is also the problem of ENCODE using ASCII encoding, then the first default encoding produces more errors.

Coding problems in python2 refer to three very good articles

  • The nature of Python coding errors
  • The ultimate Guide to Python 2.x character encoding
  • Familiar and unfamiliar character encodings

Since python2 is not installed on my computer, all the test results are performed on this site

Column information

Column home: Programming in Python

Table of contents: table of contents

Version description: Software and package version description