This is the second article we answer questions, but this is not a fan’s problem, but just brother in the crawler encountered a situation, today’s record.

1. Chinese garbled characters

Let’s not talk about the logic behind it, directly on the talent!

import requests

url = "http://www.baidu.com"
r = requests.get(url)
We found that the printing result of the requested webpage data is as follows:

Chinese characters are garbled, the solution is as follows:

# Automatically select the appropriate encoding method
r.encoding = r.apparent_encoding
Chinese garbled code display normal!!

Sometimes, we will encounter situations that cannot be solved after this operation, such as the performance of sina’s home page:

import requests
url = "http://www.sina.com.cn/"
r = requests.get(url)
r.encoding = r.apparent_encoding
If you use gzip to compress the web page, you must decode it. If you use r.tent, it will decode it automatically:

import requests

url = "http://www.sina.com.cn/"
r = requests.get(url)
# specify the encoding and decoding mode
html = r.content.decode('UTF-8')
# r.encoding = 'utf-8'
Charset = charset (); charset = charset (); charset ();

2. HTML entity coding

In some cases, we request web page data with a lot of HTML entity encoding.

For example, one of the review data I requested while crawling TAPTAP game reviews is as follows:

'So far so good, just hellip; … < br > to four-star because why is not the same suit can't into community 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 '
You can see, there’s a ‘& Hellip; ‘, this is HTML entity representing ellipsis ‘… ‘, so it needs to be processed!

How do you deal with that? Look!!

In [1]: s = 'So far so good, just hellip; … < br > to four-star because why is not the same suit can't into community 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 '

In [2] :import html

In [3]: html.unescape(s)
Out[3] :'So far so good, just... < br > to four-star because why is not the same suit can't into community 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 '
For the string

, which is a newline character, we can simply replace with \n.

That’s all for this time, there may be more garbled situations in the future, we will add to the sequel!

