Python crawler library for requests

This is the 11th original article on learning Python every day

After using urllib library, it is very difficult to get a cookie. For example, it takes several steps to get a cookie. This is a bit different from python style. The answer is requests, a third-party library created by kennethreitz to make it easier for python developers to initiate and process requests. It also has a name: HTTP for Humans. As the name suggests, it’s used to request HTTP. Those who want to see the source code can search for his name on Github.

Here is how to use this library.

(This article was originally published on the public account of learning Python.)

Since this is a third-party library, we need to download it and type it on the command line

pip install requests

Copy the code

You can ignore that if you’re anacoda

Once installed, use it

Perform simple operations

Send a GET request

# send request

import requests

response = requests.get('http://httpbin.org/get')

Get the returned HTML information

print(response.text)

Copy the code

This sends a GET request and prints the returned content. This no longer needs to know the encoding of the page, although encoding problems sometimes occur, but you can specify the encoding type as well, for example:

response.encoding = 'utf-8'

Copy the code

Once specified, you can code normally, assuming you know the code type of the page.

Out of this, we can also get the following information

print(response.headers)

Request status code

print(response.status_code)

Get the binary content of the web page

print(response.content)

print(response.url) Get the requested URL

print(response.cookies) # get cookie

Copy the code

Think it’s easy, just one line of code. No more steps of code or anything.

The next step is to bring a request header to make a request

You can also add headers for requests

headers = {'User-Agent':'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}

response = requests.get('httpbin.org/get', headers=headers )

print(response.headers)

print(response.text)

Copy the code

Adding a request header is simply adding a keyword argument

You can also make get requests with parameters

# get request with parameters

data = {'name': 'june'.'password': 123456}

response = requests.get('http://httpbin.org/get', params=data)

print(response.text)

Copy the code

What if I need to log in? How do I send a POST request? I tell you, it’s still easy

# post request

data = {'name': 'june'.'password': 123456}

response = requests.post('http://httpbin.org/post', data=data, headers=headers)

print(response.text)

Copy the code

Add a data keyword parameter and post the login parameter to submit.

Can I make more requests than the two above? I’m happy to tell you that it will. For example, if you want to send a PUT request, like this

requests.put()

requests.delete()

Copy the code

This is where put and DELETE requests are sent, as are other requests, but I won’t go into details.

Make more complex requests

When logging in, we sometimes need to enter a verification code, so how to enter? Crawler can not see the web page, the simplest way is to download the picture of the verification code and then manually input, so how do we download? We can send a request to the URL of the image and store the returned content in a file in binary mode.

The code is as follows:

Read binary data from the web, such as images

response = requests.get('https://www.baidu.com/img/bd_logo1.png', headers=headers)

# this is the bytecode directly, this is the file to save

print(response.content)

# this is the return of the decoded content, this is garbled

print(response.text)

Download the image as a file

with open('baidu.png'.'wb') as f: # Note that the write mode is binary

 f.write(response.content)

 print('Download complete')

Copy the code

It’s still pretty simple, and I have to say, it’s pretty darn good.

When we need to upload files, such as pictures, we can also use the post method to send it out

# upload file

files = {'picture': open('baidu.png'.'rb')}

response = requests.post('http://httpbin.org/post', files=files)

print(response.text)

Copy the code

Get the cookie and do something about it

# get cookie

response = requests.get('https://www.baidu.com')

for k, v in response.cookies.items():

 print(k, '=', v)

Copy the code

When a web page returns json, we don’t need the JSON library to parse it. We can use the Requests method directly to parse it, which has the same effect

# parse json

j = response.json() You can use the JSON library to parse it and get the same result

Copy the code

In the Urllib library, you need to save cookies to save login information, but in the Requests library, you just need to use requests. Session () to save information.

Use a session to hold login information

session = requests.session()

response = session.get('http://httpbin.org/cookies/set/number/123456')

print(response.text)

Copy the code

So you can save the login, you don’t have to worry about cookies, but you can just grab one session at a time and use it for requests or other operations. There is no need to create a sesion for each request or operation, as this will not save the login information

When a site is not secure, you need to use a certificate to verify, such as this site

https://www.12306.cn

At this time to access the contents of the website, we need to verify, the code is as follows

# certificate verification

response = requests.get('https://www.12306.cn', verify=False) There will be validation errors if you do not add this keyword argument, because the protocol for this site is not trusted

Copy the code

This allows access, but with a warning

E:\anaconda\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings

 InsecureRequestWarning)

Copy the code

We can also add a cert keyword parameter to the request. The value is a tuple of trusted certificates, and write the account and password, etc. We won’t demonstrate this here

We can do the same for websites that require certification

from requests.auth import HTTPBasicAuth

# set authentication

Get (' request. Get ', auth=HTTPBasicAuth('user', 'passwd')

You can also do this

Get (' request. Get ', auth=('user', 'passwd')

Copy the code

Since I can’t find a site that requires certification, I won’t do the demo.

Requests can also use proxy IP to request sites to prevent the IP from being blocked and unable to climb. Using proxy IP is also much simpler than the urllib library, with the following code:

# set proxy

proxies = {'http': 'http://122.114.31.177:808'.

   'https': 'https://119.28.223.103:8088'}

Add the above proxy on request

response = requests.get('http://httpbin.org/get', proxies=proxies)

print(response.text)

Copy the code

The above dictionary format needs to be proxies, and then add a keyword parameter to the request.

Request Exception Handling

When a program is running, it is forced to stop when an error is encountered. If you want to continue running, you need to catch exceptions to keep the program running.

There is a library in the Requests library that handles requests. Exceptions

Let’s briefly deal with the request timeout case

import requests

from requests.exceptions import ReadTimeout, ConnectTimeout, HTTPError, ConnectionError, RequestException

# catch exception

try:

 response = requests.get('http://httpbin.org/get', timeout=0.1) Throw an exception if no response is received within the specified time

 print(response.text)

except ReadTimeout as e:

 print('Request timed out')

except ConnectionError as e:

 print('Connection failed')

except RequestException as e:

 print('Request failed')

Copy the code

Three exceptions are caught here, and since ReadTimeout is a subclass of ConnectionError, ReadTimeout is caught first, followed by the parent class. The same goes for ConnectionError and RequestException

See the documentation for more exception handling.

The last

The above are my notes in learning and personal use of some pits are simply recorded up, I hope it is useful to you, if you want to see more usage can go to the official document. I also put the code on Github, so you can check it out if you want.

GitHub:github.com/SergioJune/…

Official documentation: docs.python-requests.org/zh_CN/lates…

Learning resources: https://edu.hellobi.com/course/157

Daily learning python

A public account dedicated to Python

Related Posts

Go learning journey: Slice | Go Theme month

Multithreaded completablefuture

CMakeLists Common syntax for formal environment templates