For those of you who are new to Chinese, the encoding and error reporting in Chinese may be a headache.

If, like me, you want to get things done as quickly as possible without digging too deeply, you might write some exception handling to get UnicodeEncodingError out of the way, but you might be surprised when you start to wonder how many encoding error messages you throw away. So, you still want to sit down and face yourself and have to figure out what UTF-8 is, what GB2312 is, what GBK is and what it’s all about. Just as it feels good to rip a band-aid off a small cut, sometimes when you get to grips with a problem you’ve been avoiding (and sometimes it doesn’t take courage), you’ll feel “just so” and be able to solve it once and for all.

The most succinct tutorial I could find on Python handling Unicode is:

Unicode In Python, Completely Demystified

Make a brief list of the most important and useful points:

Solution

  1. Decode early (Decode as early as possible, converting the contents of a file to Unicode for further processing)
  2. Unicode Everywhere (All internal processing in Unicode)
  3. Encode late (finally Encode returns the desired encoding, such as writing the final result to the result file)

1. Decode early

Decode to <type ‘Unicode’ > ASAP

>>> def to_unicode_or_bust(

… Obj, utf-8 encoding = ‘ ‘) :

… if isinstance(obj, basestring):

… if not isinstance(obj, unicode):

… obj = unicode(obj, encoding)

… return obj

>>>

detects if object is a string and if so converts to unicode, if not already.

2. Unicode everywhere

>>> to_unicode_or_bust(ivan_uni)

U ‘Ivan Krsti \ u0107’

>>> to_unicode_or_bust(ivan_utf8)

U ‘Ivan Krsti \ u0107’

>>> to_unicode_or_bust(1234)

1234

3. Encode late

Encode to <type ‘STR’ > when you write to disk or print

> > > f = open (‘/TMP/ivan_out. TXT ‘, ‘w’)

> > > f.w rite (ivan_uni encode (” utf-8 “))

>>> f.close()

I used to think unicode-related processing was a dirty job, and I used to try and fix it with exception handling, but after watching this tutorial, I suddenly realized.

I hope you can also clear up the clue of dealing with Chinese as soon as possible, and frankly face the “mysterious” Unicode

Call waiting welfare

1. Recently sorted out 20G resources, including product/operation/test/programmer/market, etc., and Internet practitioners [necessary skills for work, professional books on the industry, precious books on interview questions, etc.]. Access:

  • Scan the code of wechat to follow the public account “Atypical Internet”, forward the article to the moments of friends, and send the screenshots to the background of the public account to obtain dry goods resources links;

2. Internet Communication Group:

  • Pay attention to the public account “atypical Internet”, in the background of the public account reply “into the group”, network sharing, communication;

Author: increasingly, blog: https://coolshell.cn/articles/18190.html