This is the 19th day of my participation in the August Wenwen Challenge.More challenges in August

The requests module has many alternatives, such as the Requests module urllib, but the requests module is mostly used in work. The code is simple and easy to understand, compared to the bulky URllib module. There will be less crawler code written using Requests, and it will be simpler to implement a feature. Therefore, it is recommended that you master the use of this module.

Requests the module

Let’s learn how to implement our crawler in code, okay

1. Requests module introduction

Requests the document http://docs.python-requests.org/zh_CN/latest/index.htmlCopy the code

The ** 1.1 Requests module is used for: **

Send an HTTP request to get the response dataCopy the code

The 1.2 Requests module is a third-party module that requires an additional installation in your Python (virtual) environment

pip/pip3 install requests

Copy the code

The 1.3 Requests module sends GET requests

Requirements: Send requests to the Baidu home page via Requests to get the source code for the page. Run the following code and observe the print-outCopy the code

1.2.1- Simple code implementation

import requests

Target url

url = ‘www.baidu.com’

Send a GET request to the target URL

response = requests.get(url)

Print response content

print(response.text)

Knowledge: to master requests module 2 \. Send a get request response response object -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -Copy the code

Observe the above code running results found that there are a lot of garbled code; This is due to the different character sets used in encoding and decoding; We try to use the following method to solve the Chinese garble problem

# 1.2.2-response.content import requests # target URL url = 'https://www.baidu.com' # send get request to target URL response = Requests. Get (url) # print(response.text) print(response.content.decode())Copy the code

Response. text is the result of the Requests module decoding the coded character set inferred by the Chardet module

Text = response.content.decode(‘ presumed coded character set ‘)

3. We can search for charset in the source code of the web page and try to refer to the coded character set. Note that there are some inaccuracies

**2.1 Difference between response.text and response.content: **Copy the code

response.text

Type: STR Decoder type: Requests The module automatically makes an educated guess about the encoding of the response based on the HTTP header, the inferred text encodingCopy the code

response.content

Type: bytes Decoding type: not specifiedCopy the code
**2.2 Decode Response. content to solve Chinese garbled characters **Copy the code

Response. The decode () the default utf-8

response.content.decode(“GBK”)

Common coded character set

* UTF-8 * GBK * GB2312 * ASCII Asker code) * ISO-8859-1 **2.3 Other common attributes or methods of the response object ** > Response = requests. Get (URL) Response In addition to text and content, there are other common attributes or methods in the response object to obtain the response content:Copy the code

Response. url Indicates the url of the response. Sometimes the url of the response does not match the URL of the request

Response. status_code Indicates the response status code

Response.request. headers Request header corresponding to the response

Response. The response headers

Response.request. _cookies Cookies that respond to the request; Returns the cookieJar type

Response. cookies The cookie of the response (after the set-cookie action; Returns the cookieJar type

Response.json () automatically converts json string responses to Python objects (dict or list)

Import requests # target URL url = 'https://www.baidu.com' # request get request response = Get (url) # print response content # print(response.text) # print(response.content.decode()) # notice here! Print (response.status_code) print(response.request.headers) print(response.request.headers) print(response.status_code) print(response.request.headers) print(response.request.headers) print(response.status_code) print(response Print (response.headers) # Print (response.request._cookies) # Print the cookies carried by the request Print in response to carry the cookies ` ` ` 3 \. Requests module sends a request -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - * * * * 3.1 to send with the header request ` ` ` first we write a code for baidu's home pageCopy the code

import requests

url = ‘www.baidu.com’

response = requests.get(url)

print(response.content.decode())

Prints the request header information corresponding to the response request

print(response.request.headers)

* * * * 3.1.1 thinkingCopy the code

Compare the browser baidu home page source code and code baidu home page source, what is the difference?

View page source code method: right-click - view page source code or right-click - checkCopy the code

Compare the corresponding URL response content and code baidu home page source, what is the difference?

To view the Response content of the corresponding URL: right-click - Check click Net Work check Preserve log Refresh the page to view the Response of the URL under the Name column that is the same as the browser address barCopy the code

The source code of baidu home page is very little, why?

We need to review the concept of crawler with the request header information, to simulate the browser, deceive the server, and obtain the same content as the browser. There are many fields in the request header, among which the user-Agent field is essential, representing the client's operating system and browser informationCopy the code
** requests. Get (url, headers=headers)Copy the code

The HEADERS parameter receives the request header in dictionary form

The request header field name is the key, and the corresponding value is the value

**3.1.3 Complete the code implementation **Copy the code

Copy user-agent from browser and construct headers dictionary. After completing the following code, run the code to see the results

Import requests URL = 'https://www.baidu.com' headers = {" user-agent ": "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit Chrome/ 537.36 (KHTML, like Gecko) Safari/537.36"} Response = requests. Get (URL, Headers =headers) print(response.request.headers) print(response.request.headers) ' '**3.2 Send a request **' ' When we use Baidu search, we often find that there will be a URL address? 3.2.1 Import requests headers = {" user-agent ": request headers = {" user-agent ": import requests headers = {" user-agent ": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/54.0.2840.99 Safari/537.36"} url = 'https://www.baidu.com/s?wd=python' response = requests. Get (url, Headers =headers) ' ' '**3.2.2 Carry parameter dictionary ** 1 through params. Params' 'import requests headers = {" user-agent ": "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/54.0.2840.99 Safari/537.36"} # this is the destination URL # url = 'https://www.baidu.com/s?wd=python' # there is no question at the end result is the same URL = 'https://www.baidu.com/s?' # request argument is a dictionary i.e. wd=python kw = {'wd': 'python'} # Response = requests. Get (url, headers=headers, Params =kw) print(response.content) ** We can then add a Cookie to the HEADERS parameter to simulate a normal user's request. Let's take github login as an example: ** *3.3.1 Github Login Packet Capture Analysis ** ** open a browser, right-click - check, click Net Work, Select Preserve log to access github login URL https://github.com/login enter your account password and click login to access a URL that requires login to access the correct content. For example, click Your profile in the upper right corner to visit https://github.com/USER_NAME to determine the URL, and then determine the user-agent and Cookie in the request header information required to send the request! [](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/7ea42232737e4616989dad69526f4aa5~tplv-k3u1fbpfcp-zoom-1.image) The request header fields and values in the browser must be the same as those in the headers parameter. The values of the Cookie key in the HEADERS request parameter dictionary are stringsCopy the code

import requests

url = ‘github.com/USER_NAME’

Construct the request header dictionary

Headers = {# user-agent ‘user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36’, # copy from browser ‘Cookie’: ‘XXX here is the copied cookie string’}

The request header parameter dictionary carries the cookie string

resp = requests.get(url, headers=headers)

print(resp.text)

**3.3.3 Run code to verify results **Copy the code

If this is your Github account, the headers parameter will be used successfully to retrieve the page that you can access only after logging in

! [] (https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/59022d4951254bccb574a15328ecda12~tplv-k3u1fbpfcp-zoom-1.image) knowledge: Understand the use of headers carrying cookies **3.4 cookies parameter **Copy the code

In the previous section we carried cookies in the HEADERS parameter, or we could use specialized cookies parameters

The form of the cookies parameter: dictionary

Cookies = {” name of cookie “:” value of cookie “}

The dictionary corresponds to the Cookie string in the request header, and each dictionary key-value pair is divided by semicolon and space. The left of the equals sign is the name of a Cookie, and the right of the equals sign is the value of the Cookie dictionaryCopy the code

How to use cookies parameters

response = requests.get(url, cookies)

The dictionary needed to convert cookie strings into cookies parameters:

cookies_dict = {cookie.split(‘=’)[0]:cookie.split(‘=’)[-1] for cookie in cookies_str.split(‘; ‘)}

Note: Cookies generally have an expiration date and need to be retrieved once they expire

Import requests URL = 'https://github.com/USER_NAME' # request headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Cookies_str = 'Cookie string copied from browser' cookies_dict = {cookie.split('=')[0]:cookie.split('=')[-1] for cookie in cookies_str.split('; Resp = requests. Get (url, headers=headers, cookies=cookies_dict) print(resp.text) Master the use of cookies parameters **3.5 Methods for converting cookieJar objects into cookies dictionaries ** Resposne objects obtained using Requests with cookies attributes. The value of this property is a cookieJar type that contains a cookie set locally by the other server. How do we turn this into a dictionary of cookies? Cookies_dict = requests.utils.dict_from_cookiejar(Response.cookies) where Response. cookies returns an object of type cookieJar The requests.utils.dict_from_cookiejar function returns a cookie dictionaryCopy the code