For every successful programmer, there are probably a number of bald men behind them, providing them with various development tools & code libraries, including… All kinds of metaphysical bugs…

The beginning of metaphysics

I recently encountered a strange problem with a crawler project in Python. It doesn’t fire every time. It’s really confusing…

The following error message is displayed:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026' in position 512: ordinal not in range(256)
Copy the code

Go to the search engine for the error message. The Chinese community says that there is Chinese data in the body or headers request.

Encode utF-8 and then decode it with Latin-1.

But there is no Chinese in the data I requested!

This led to the head-scratching bug hunt

Look at the code

(Smelly and long to see, suggest skipping to see the follow-up)

Take a look at the format of the data I submitted

This is related to identity authentication (too long only part taken)

"spider9": {
  "Authorization": "Basic Z2VjZW50ZXJfYWR"."Blade-Auth": "bearer eyJhbGciOiJIUzI1NiIsI"."cookie": "oauth=eyJhY2Nlc3NfdG"
}
Copy the code

Here is the header code

headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; x64; The rv: 88.0) Gecko / 20100101 Firefox 88.0 / '.'Accept': 'application/json, text/plain, */*'.'Accept-Language': 'zh-CN,zh; Q = 0.8, useful - TW; Q = 0.7, useful - HK; Q = 0.5, en - US; Q = 0.3, en. Q = 0.2 '.'Accept-Encoding': 'gzip, deflate'.'Content-Type': 'application/json; charset=utf-8'.'Authorization': config['spider9'] ['Authorization'].'Blade-Auth': config['spider9'] ['Blade-Auth'].encode('utf-8').decode('latin1'),
    'Content-Length': '60'.'Connection': 'keep-alive'.'Cookie': config['spider9'] ['cookie'].'Pragma': 'no-cache'.'Cache-Control': 'no-cache',}Copy the code

Here’s the Requests request code:

response = requests.post(
    url, headers=headers, verify=False,
    json={
        'cityCode': '1234'.'createTimeFrom': None.'createTimeTo': None})Copy the code

Simple look at this code, should be completely no problem, in fact, my other crawlers are also written in this way, has been stable operation for more than a year, the recent new crawler is not… Sometimes code problems are just so metaphysical…

Just now, I found that the Internet user said that encode is the first and then decode method. I tried to add the Authorization, Blade-Auth and Cookie fields in headers:

'Authorization': config['spider9'] ['Authorization'].encode('utf-8').decode('latin1')
Copy the code

The UnicodeEncodeError error was not reported, but the backend server reported that it was not logged in…

So what’s the problem?

Set environment variables to Stack Overflow. Set environment variables to Stack Overflow.

export PYTHONUTF8=1
Copy the code

Then print the system code and locale in Python:

import sys
import locale

print(sys.getfilesystemencoding())
print(locale.getpreferredencoding())
Copy the code

The output

utf-8
UTF-8
Copy the code

Oh ho ~ try again to run… Still no, drunk? That’s not even the question.

All right, I’m done. I’m done. Where you fall, where you lie down

So what’s the problem? It’s still a mystery…

subsequent

I’m so tired of writing crawlers in C# instead of Python…

What about “Life is short, I use Python”? How did it get so messy TAT…

(PS: I made a crawler platform before, which can schedule crawlers implemented in different languages, and provide a unified configuration center and a unified data persistence interface. Therefore, there is no big difference in what language each crawler is written in.)

I am writing this article for my record and hope that one day this problem will be solved.

The resources

  • Stackoverflow.com/questions/6…
  • www.cnblogs.com/xtmp/p/1269…