Hello, everyone, I am the talented brother.

This is the second article we answer questions, but this is not a fan’s problem, but just brother in the crawler encountered a situation, today’s record.

1. Chinese garbled characters

Let’s not talk about the logic behind it, directly on the talent!

import requests

url = "http://www.baidu.com"
r = requests.get(url)
Copy the code

We found that the printing result of the requested webpage data is as follows:

Chinese characters are garbled, the solution is as follows:

# Automatically select the appropriate encoding method
r.encoding = r.apparent_encoding
Copy the code

Chinese garbled code display normal!!

Sometimes, we will encounter situations that cannot be solved after this operation, such as the performance of sina’s home page:

import requests
url = "http://www.sina.com.cn/"
r = requests.get(url)
r.encoding = r.apparent_encoding
Copy the code

If you use gzip to compress the web page, you must decode it. If you use r.tent, it will decode it automatically:

import requests

url = "http://www.sina.com.cn/"
r = requests.get(url)
# specify the encoding and decoding mode
html = r.content.decode('UTF-8')
# r.encoding = 'utf-8'
Copy the code

Charset = charset (); charset = charset (); charset ();

2. HTML entity coding

In some cases, we request web page data with a lot of HTML entity encoding.

For example, one of the review data I requested while crawling TAPTAP game reviews is as follows:

'So far so good, just hellip; … < br > to four-star because why is not the same suit can't into community 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 '
Copy the code

You can see, there’s a ‘& Hellip; ‘, this is HTML entity representing ellipsis ‘… ‘, so it needs to be processed!

How do you deal with that? Look!!

In [1]: s = 'So far so good, just hellip; … < br > to four-star because why is not the same suit can't into community 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 '

In [2] :import html

In [3]: html.unescape(s)
Out[3] :'So far so good, just... < br > to four-star because why is not the same suit can't into community 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 '
Copy the code

For the string

, which is a newline character, we can simply replace with \n.

That’s all for this time, there may be more garbled situations in the future, we will add to the sequel!

If you have similar problems, you can communicate with us!