Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

Python crawler (12) : Urllib

Urllib: A Python crawler (13)

Urllib: A Python crawler (14)

Python crawler (15) : Urllib

Python crawler (16) : Urllib crawler (16) : Urllib crawler

Python crawler (17) : Basic usage for Requests

timeout

We also covered timeouts in urllib, so let’s look at how timeouts in Requests should be written.

import requests

r = requests.get("https://www.geekdigging.com/", timeout = 1)
print(r.status_code)Copy the code

The specific execution results of small make up will not be posted.

Pay attention to

Timeout is only valid for connection procedures and has nothing to do with the download of the response body. Timeout is not a time limit for the entire download response, but if the server does not respond within timeout seconds, an exception will be raised (to be more precise, Explicitly specified, requests do not timeout If no timeout is specified.

The proxy Settings

As with urllib, there is no more introduction to the code:

Import requests Proxies = {" HTTP ": "http://10.10.1.10:3128", "HTTPS ": "Http://10.10.1.10:1080",} requests. Get (" https://www.geekdigging.com/ ", proxies = proxies)Copy the code

Of course, running this example directly may not work because the agent may be invalid, so you can find some free agents to test on your own.

Requests supports Socket proxies in addition to HTTP proxies, which is an optional feature that is not included in the Requests standard library and needs to be installed before you can use it.

pip install requests[socks]Copy the code

With dependencies installed, using a SOCKS proxy is as simple as using an HTTP proxy:

import requests

proxies_socket = {
    'http': 'socks5://user:pass@host:port',
    'https': 'socks5://user:pass@host:port'
}

requests.get("https://www.geekdigging.com/", proxies = proxies_socket)Copy the code

Cookies

When we used urllib to handle Cookies, it was a bit complicated to write, but using Requests makes it very easy to get and set Cookies in one step. Let’s start with a simple example:

import requests

r = requests.get("https://www.csdn.net")
print(type(r.cookies), r.cookies)
for key, value in r.cookies.items():
    print(key + '=' + value)Copy the code

The results are as follows:

< class 'requests. Cookies. RequestsCookieJar' > < RequestsCookieJar [< Cookie dc_session_id = 10 _1575798031732. For 659641 .csdn.net/>, <Cookie uuid_tt_dd=10_19615575150-1575798031732-646184 for .csdn.net/>, <Cookie acw_tc=2760827715757980317314369e26895215355a996a74e112d9936f512dacd1 for www.csdn.net/>]> Dc_session_id = 10 _1575798031732. 659641 uuid_tt_dd _19615575150 = 10-1575798031732-646184 acw_tc=2760827715757980317314369e26895215355a996a74e112d9936f512dacd1Copy the code

We use the cookies attribute in Requests to get cookies directly. Through print we can find that it is the type of requests. Cookies, RequestsCookieJar, then use the items () method is transformed into a list of the yuan group, traverse the output each Cookie name and value, realizes the Cookie traversal parsing.

Maintain session state through Cookies

Since access to Zhihu requires login, we selected Zhihu as the test site and first visited zhihu directly to see the returned status code.

import requests

r = requests.get('https://www.zhihu.com')
print(r.status_code)Copy the code

The results are as follows:

400Copy the code

Status code 400 indicates a Bad request.

We open the browser, log in zhihu, open the F12 developer mode, and see what the Cookies are after we log in.

Let’s copy this and add it to the access header:

import requests headers = { 'cookie': '_zap=7c875737-af7a-4d55-b265-4e3726f8bd30; _xsrf=MU9NN2kHxdMZBVlENJkgnAarY6lFlPmu; d_c0="ALCiqBcc8Q-PTryJU9ro0XH9RqT4NIEHsMU=|1566658638"; UM_distinctid=16d16b54075bed-05edc85e15710b-5373e62-1fa400-16d16b54076e3d; tst=r; q_c1=1a9d0d0f293f4880806c995d7453718f|1573961075000|1566816770000; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49 = 1574492254157954, 599157721, 552157721, 901; tgw_l7_route=f2979fdd289e2265b2f12e4f4a478330; CNZZDATA1272960301=1829573289-1568039631-%7C1575793922; capsion_ticket="2|1:0|10:1575798464|14:capsion_ticket|44:M2FlYTAzMDdkYjIzNDQzZWJhMDcyZGQyZTZiYzA1NmU=|46043c1e4e6d9c381e b18f5dd8e5ca0ddbf6da90cddf10a6845d5d8c589e7754"; z_c0="2|1:0|10:1575798467|4:z_c0|92:Mi4xLXNyV0FnQUFBQUFBc0tLb0Z4enhEeVlBQUFCZ0FsVk53eFRhWGdBSlc3WFo1Vk5RUThBMHMtanZIQ2tY cGFXV2pn|02268679f394bd32662a43630236c2fd97e439151b0132995db7322736857ab6"; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1575798469', 'host': 'www.zhihu.com', 'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/78.0.3904.108 Safari/537.36'} r = requests. Get ('https://www.zhihu.com', headers = headers) print(r.text)Copy the code

The result is as follows:

The result is too long, small editor directly cut a picture.

Note that UA and host are added to the request header, which would otherwise be inaccessible.

Of course, in addition to pasting such a string directly, cookies can also be set by the parameters of the construction of cookies, which requires the construction of a RequestsCookieJar object, the step is relatively complicated, the result is the same.

# build RequestsCookieJar cookies = '_zap= 7c875737-AF7A-4d55-b265-4e3726f8bd30; _xsrf=MU9NN2kHxdMZBVlENJkgnAarY6lFlPmu; d_c0="ALCiqBcc8Q-PTryJU9ro0XH9RqT4NIEHsMU=|1566658638"; UM_distinctid=16d16b54075bed-05edc85e15710b-5373e62-1fa400-16d16b54076e3d; tst=r; q_c1=1a9d0d0f293f4880806c995d7453718f|1573961075000|1566816770000; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49 = 1574492254157954, 599157721, 552157721, 901; tgw_l7_route=f2979fdd289e2265b2f12e4f4a478330; CNZZDATA1272960301=1829573289-1568039631-%7C1575793922; capsion_ticket="2|1:0|10:1575798464|14:capsion_ticket|44:M2FlYTAzMDdkYjIzNDQzZWJhMDcyZGQyZTZiYzA1NmU=|46043c1e4e6d9c381e b18f5dd8e5ca0ddbf6da90cddf10a6845d5d8c589e7754"; z_c0="2|1:0|10:1575798467|4:z_c0|92:Mi4xLXNyV0FnQUFBQUFBc0tLb0Z4enhEeVlBQUFCZ0FsVk53eFRhWGdBSlc3WFo1Vk5RUThBMHMtanZIQ2tY cGFXV2pn|02268679f394bd32662a43630236c2fd97e439151b0132995db7322736857ab6"; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1575798469' jar = requests.cookies.RequestsCookieJar() for cookie in cookies.split('; '): key, value = cookie.split('=', 1) jar.set(key, value) headers_request = { 'host': 'www.zhihu.com', 'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/78.0.3904.108 Safari/537.36'} r = requests. Get ('https://www.zhihu.com', cookies = jar, headers = headers) print(r.text)Copy the code

The results can be accessed successfully, small make up here will not paste.

Split (), then use set() to assign key and value to RequestsCookieJar. Then assign the RequestsCookieJar value to the cookies parameter when accessing Zhihu. It should be noted that the header parameter should not be less, but the cookies in the original header do not need to be set.

Session to maintain

This next one is the big one, which is not available in URllib.

Imagine a scenario where we are crawling data from a web site, some of which requires GET requests and some requires POST requests. But we write a get() and a POST () in the program, which are actually two different sessions.

Some students may say, Sir, the session maintenance we talked about is done by Cookies, you can add Cookies every time you request.

No problem, that’s fine, but it’s a bit of a hassle. Requests provides us with a more concise and efficient approach — Session.

To demonstrate this, we’ll use https://httpbin.org/, which we described earlier, by visiting the link: https://httpbin.org/cookies/set/number/123456789 to set up a Cookies, called number, the content is 123456789.

Let’s start with an example that uses Requests directly:

import requests

requests.get('https://httpbin.org/cookies/set/number/123456789')
r = requests.get('https://httpbin.org/cookies')
print(r.text)Copy the code

The results are as follows:

{
  "cookies": {}
}Copy the code

We call the get() method directly twice and don’t get Cookies on the second call. Let’s change the Session again:

import requests

s = requests.Session()
s.get('https://httpbin.org/cookies/set/number/123456789')
r = s.get('https://httpbin.org/cookies')
print(r.text)Copy the code

The results are as follows:

{
  "cookies": {
    "number": "123456789"
  }
}Copy the code

Obviously, we managed to get the Cookies we set earlier.

Therefore, Session can be used to simulate the same Session without manually setting Cookies. It is widely used in our daily use, because it can simulate accessing different pages of the same site in the same browser. Therefore, when we climb many pages that need to be logged in, Great convenience for our code writing.

The sample code

All of the code in this series will be available on Github and Gitee.

Example code -Github

Example code -Gitee