background

Python has long been popular as one of the most elegant languages. It can help you get rid of repetitive tasks, such as collecting data on various kinds of pain, importing and exporting tables needed by the boss, etc. All you have to do is write python scripts, run them, drink a cup of tea and wait for the results.

This article was created —- when I ran into a python 2.x character code problem using scrapy.

Preliminary knowledge

ASCII

First, we need to make it clear that all information in a computer is ultimately represented in binary. Each bit has two states: 0 and 1. Eight bits, called a byte, can represent 256 states.

As a pioneer in the computer field, the United States developed a character code in the 1960s that used binary to represent English letters, numbers and symbols. This character code was ASCII

Unicode

With the development of computers, the coding requirements of other languages soon emerged, and ASCII single-byte coding was no longer sufficient because of the limited number of characters it could represent. In order not to conflict with ASCII encoding, capable countries began to design separate encoding —- using more bytes to represent characters, China developed GB2312

However, a series of problems are brought about by the increasing number of character encoding schemes. Hence Unicode. It uses the same character set to represent the characters of all languages and solves a series of problems caused by different character encoding schemes

Let’s talk about Python 2.x Chinese encoding

First give you a simple 🌰, more common

The contents of the test.py file:

#! /usr/bin/python
city = 'Beijing'
print city
Copy the code

Run Python test.py and the result looks like this:

The solution is to change the default test.py encoding mode

#! /usr/bin/python
# -*- coding: utf-8 -*-
city = 'Beijing'
print city
Copy the code

Now that you’ve looked at 🌰, take a look at STR and Unicode

STR and Unicode are subclasses of BaseString. Strictly speaking, however, Unicode is a literal string, and STR is a bytecode —-, a sequence of Unicode-encoded bytes. In the last screenshot, to clear up any confusion, s == s2

With 🌰, understand the difference between STR and Unicode, and don’t worry about garbled characters or errors (whether from Unicode to STR or from STR to Unicode), and avoid the following operations

  • 1. It is illegal to decode Chinese Unicode
  • 2. It is illegal to encode Chinese STR

With that said, show me you code (squint)🌰 :

#! /usr/bin/python
# -*- coding: utf-8 -*-
u_z = u'Beijing'
s_z = 'Beijing'
u=u'jing'
s='jing'

print u_z
print s_z
print repr(u_z)
print repr(s_z)

print '= = = = = = = = = = = = = = = = = ='

print u
print s
print repr(u)
print repr(s)

print '= = = = = = = = = = = = = = = = = ='

print u.decode()
print s.encode()

print u_z.decode()
print s_z.encode()
Copy the code

A little advice

  • Under the same project, unified source file character coding, unified system character coding, to prevent garbled code
    #! /usr/bin/python
    # -*- coding:utf8 -*-
    import sys
    reload(sys)
    sys.setdefaultencoding("utf-8")
    Copy the code
  • Use Unicode whenever possible
  • Dumps array, dictionary stored in database before json.dumps
  • Decode and encode formats are unified

Ok, this is my understanding of python2.x Chinese character encoding, I hope to help you

Shall not be reproduced without my permission. There are omissions and superficial articles, please correct them