My crawlerCopy the code
import urllib.request
import http.cookiejar
import codecs
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
#head: dict of header
def makeMyOpener(head = {
    'Connection':'keep-alive'.'Accept':'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp, * / *; Q = 0.8 '.'Accept-Language':'zh-CN,zh; Q = 0.8, en. Q = 0.6 '.'User-Agent':'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'
    }) :

        cj = http.cookiejar.CookieJar()
        opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
        header = []
        for key, value in head.items():
            elem = (key, value)
            header.append(elem)
        opener.addheaders = header
        return opener

oper = makeMyOpener()
uop = oper.open('https://www.baidu.com/', timeout = 1000)
data = uop.read()
if data[:3] == codecs.BOM_UTF8:
    data = data[3:]
print(data.decode('utf-8'))
   
1234567891011121314151617181920212223242526272829

   

    
   

   
1234567891011121314151617181920212223242526272829
Copy the code
Today, when writing a crawler in Python, you always get a similar error:Copy the code
UnicodeEncodeError:'gbk' codec can't encode character '\xa9' in position 0:illegal multibyte sequence 1 1Copy the code
Some software, such as Notepad, insert three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of a UTF-8 encoded file. So we need to remove these characters ourselves when reading. Python's Codecs Module defines this constant:Copy the code
import io
import sys
# remove the three invisible characters
if data[:3] == codecs.BOM_UTF8:
    data = data[3:]
print(data.decode('utf-8'))
   
123456

   

    
   

   
123456
Copy the code
After running, although this error is not reported, the result is another error reported!Copy the code
UnicodeEncodeError:'gbk' codec can't encode character '\xa0' in position 0:illegal multibyte sequence 1 1Copy the code
I did not calm down, continued to search online, and finally found the answer by referring to this article:Copy the code

Blog.csdn.net/jim7424994/…

At first I thought it was baidu using GBK encoding, but WHEN I converted it into UTF-8, some character conversion failed! Python's print() method is a problem. In Python, the print() method is encoded GBK by default in Windows 7, and not all characters are supported when printing. And this problem is usually only found in CMD. In CMD is to change the standard output encoding:Copy the code
import io  
import sys 
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
   
123

   

    
   

   
123
Copy the code
Utf-8: utF-8: utF-8: utF-8Copy the code
import io  
import sys 

Change the default encoding of standard output
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')         
   
12345

   

    
   

   
12345
Copy the code

—————————— Reprinted content ———————————-

I grabbed some byte streams from the Internet and tried to print them out.

UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 8530: illegal multibyte sequence 1 1Copy the code
codeCopy the code
import urllib.request  
res=urllib.request.urlopen('http://www.baidu.com')  
htmlBytes=res.read()  
print(htmlBytes.decode('utf-8'))  
   
1234

   

    
   

   
1234
Copy the code

The error message is very confusing, why use ‘UTF-8′ decoding, but the error message is’ GBK ‘error? Not only that, from baidu home page HTML found the following code:

<meta http-equiv="content-type" content="text/html; charset=utf-8">  
   
1

   

    
   

   
1
Copy the code

This indicates that the page does use UTF-8, why Error? In python3, there are a few common sense things about coding

1. Characters are Unicode characters, and strings are Unicode character arrays

If you test with the following code,

print('a'= ='\u0061')  
   
1

   

    
   

   
1
Copy the code

You will find that the result is True, which is sufficient to show that the two are equivalent. 2. STR to bytes is called encode, and bytes to STR is called decode. The code above decode the captured byte stream into a Unicode array

I analyzed the byte stream where \ XBB appeared based on the above error message and found a special character \xc2\ XBB », which I suspect could not be decoded.

After testing with the following code

print(b'\xc2\xbb'.decode('utf-8'))  
   
1

   

    
   

   
1
Copy the code

UnicodeEncodeError: ‘GBK’ codec can’t encode character ‘\ XBB’ in position 0: Illegal multibyte utf-8 under the sequence on the Internet to find the table, found that really special characters » utf-8 form is c2bb, unicode is \ u00bb ‘, ‘why can’t decode…

Take a closer look at the error message, it said ‘GBK’ can not encode, but my code is UTF-8 can not decode, completely different, finally let me suspect that the print function error. Immediately came the following test

print('\u00bb')  
   
1

   

    
   

   
1
Copy the code

UnicodeEncodeError: ‘GBK’ codec can’t encode character ‘\ XBB’ in position 0: The illegal multibyte sequence cannot print all Unicode characters.

Print () is the default encoding of Python. The default encoding of Python is not ‘UTF-8’, so change Python’s default encoding to ‘UTF-8’

import io  
import sys  
import urllib.request  
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') Change the default encoding of standard output
res=urllib.request.urlopen('http://www.baidu.com')  
htmlBytes=res.read()  
print(htmlBytes.decode('utf-8'))
   
1234567

   

    
   

   
1234567
Copy the code

Don’t report error, but there are a lot of garbled code (English display normal, Chinese display garbled code)!! After a bit of fiddling, it turned out to be a console problem. Specifically, when I ran the script under CMD, it was garbled, but when I ran it under IDLE, it was fine.

From this I assume that CMD is not very compatible with UTF8, whereas IDLE is, and even runs under IDLE, without “changing the default encoding of standard output”, which is utF8 by default. If you must run in CMD, then change the code, such as “GB18030”, it will work:

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')         Change the default encoding of standard output
   
1

   

    
   

   
1
Copy the code

Finally, attach the names of some commonly used Chinese-related encodings and assign them to Encoding to see different effects: ——————————

Code name use
utf8 All language
gbk Simplified Chinese
gb2312 Simplified Chinese
gb18030 Simplified Chinese
big5 Traditional Chinese
big5hkscs Traditional Chinese

\

Reference address > blog.csdn.net/jim7424994/…

(function () {

(‘ prettyprint code’).each(function () {var lines = (this).text().split(‘ \n ‘).length; var numbering = $(‘

‘).addClass(‘pre-numbering’).hide(); (this). AddClass (‘ has – numbering ‘). The parent (), append (numbering); for (i = 1; i