The cause of

UnicodeEncodeError and UnicodeDecodeError in Python 2 are notoriously tricky, and can sometimes be confusing. The authors of Fluent Python even come up with what they call a “sandwich model” to help solve such problems (it doesn’t have to be that hard, as I’ll explain later).

I came across a little question about this online today, which felt very interesting, in an article on hydrology.

UnicodeEncodeError: ‘ASCII’ codec can’t encode characters in position 0-1: Ordinal not in range(128) Then go through the log and quickly find the corresponding line of code, something like this:

thrift_obj = ThriftKeyValue(key=str(xx_obj.name))  # error line xx_obj.name is a STR
Copy the code

STR (xx_obj. Name), I don’t know if it was a mistake, or if it was done on purpose, but I can’t learn this operation.

Analysis of the

There is a string being encoded by an ASCII encoder, but it is clearly out of the range specified by the ASCII encoder, so there is an error. Hence the speculation:

  • Where should there be a Unicode string (it doesn’t matter what string it is, as long as it’s outside the ASCII range)xx_obj.name.
  • Happening somewhereCoding action“, and it is done secretly (most annoying of all these implicit conversions, there are many in Python 2), and it is not obvious from the code where it is.

Looking left and right, it should be STR (), the built-in function, so simply try the following code:

In [5]: u = u'China'

In [6]: str(u)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-6-b3b94fb7b5a0> in <module>()
----> 1 str(u)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
In [7]: b = u.encode('utf-8') In [8]: str(b) Out[8]: '\xe4\xb8\xad\xe5\x9b\xbd'


Copy the code

Sure enough. When I looked at the document, there was no valuable information, the description was too vague:

class str(object=' ')
Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that str(object) does not always attempt to return a string that is acceptable to eval(a); its goal is toreturn a printable string. If no argument is given, returns the empty string, ' 'For more information on strings see Sequence Types - STR, Unicode, list, tuple, bytearray, buffer, xrangewhich describes sequence functionality (strings are sequences), and also the string-specific methods described in the String Methods section. To output formatted strings use template strings or the % operator described in the String Formatting Operations section. In addition see the String Services section. See also unicode().
Copy the code

In our code (Python 2), each py file has a line like this:

from __future__ import unicode_literals, absolute_import
Copy the code

So I guess xx_obj.name is to give unicode string, type log, sure enough.

To solve

At this point, either convert xx_obj.name to something that STR () knows, at least not Unicode in this case, but bytes. But I didn’t do that. It was ugly. Two, I changed it to this:

thrift_obj = ThriftKeyValue(key=xx_obj.name) There is no need to call STR () because name happens to be an ASCII character
Copy the code

Bug fixes and other functions are normal.

conclusion

As mentioned earlier, there is a lot of this implicit conversion in Python 2, and it is not well documented, especially when combined with the Windows environment and print operations, the error message is ambiguous. Fluent Python has an inspiring talk about a “sandwich model” to solve this problem.

However, the general rule I follow is: Unicode only, Unicode everywhere. The method is as follows:

  • All py files must have the following header:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
#

from __future__ import unicode_literals, absolute_import
Copy the code
  • Bytes received from the outside world (from the network, from a file, etc.) are converted to Unicode first, but it is better to extract them as functions to avoid repeated encoding:
Class UnicodeUtils(object): @classmethod def get_unicode_str(CLS, bytes_str, try_decoders=(CLS, bytes_str, try_decoders=(CLS, bytes_str, try_decoders=(CLS, bytes_str, try_decoders=(CLS, bytes_str, try_decoders=(CLS, bytes_str))'utf-8'.'gbk'.'utf-16')) :"""Convert to a string (usually Unicode)"""
        
        if not bytes_str:
            return u' '

        if isinstance(bytes_str, (unicode,)):
            return bytes_str

        for decoder in try_decoders:
            try:
                unicode_str = bytes_str.decode(decoder)
            except UnicodeDecodeError:
                pass
            else:
                return unicode_str

        raise DecodeBytesFailedException('decode bytes failed. tried decoders: %s' % list(try_decoders))

    @classmethod
    def encode_to_bytes(cls, unicode_str, encoder='utf-8') :"""Convert to bytes"""
        
        if unicode_str is None:
            return b' '

        if isinstance(unicode_str, unicode):
            return unicode_str.encode(encoding=encoder)
        else:
            u = cls.get_unicode(unicode_str)
            return u.encode(encoding=encoder)
Copy the code
  • Everything sent to the outside world is converted to UTF-8 encoded bytes, as shown in the code above