Urllib +Re(regular expression) is used to write crawlers to achieve, but today I understand Requests library, it is more convenient to initiate Requests than URllib, I am based on the Chinese university MOOC in Beijing Institute of Technology teacher Song Tian “Python Web crawler and Information Extraction” course for the introduction of learning

Requests the libraryThe website address

Several main methods of the Requests library

methods instructions
requests.request() Construct a request that supports the underlying methods for each of the following methods
requests.get() The main method of getting HTML web pages, corresponding to HTTP GET
requests.head() Method of getting HTML page header information, corresponding to HTTP HEAD
requests.post() A method of submitting a POST request to an HTML web page, corresponding to an HTTP POST
requests.put() A method for submitting a PUT request to an HTML page, corresponding to HTTP PUT
requests.patch() Submit local modification requests to HTML web pages, corresponding to HTTP patches
requests.delete() Submit a DELETE request to an HTML page, corresponding to HTTP DELETE

These HTTP methods are described below

methods instructions
GET Request a resource at the URL location
HEAD Request a response message report for a URL location resource, that is, get the header information for that resource
POST Request to append new data to the resource at the URL location
PUT Request to store a resource at the URL location, overwriting the resource at the original URL location
PATCH Requests a local update of a resource at a URL location, that is, changes part of the content of that resource
DELETE Request to delete the resource stored at the URL location

The get method

Requests. Get (url,params=None,**kwargs) is the url, the argument sent, and the other arguments to the function return a Response object. Here are some properties of the Response object

attribute instructions
r.status_code Return status of the HTTP request. 200 indicates a successful connection and 404 indicates a failed connection
r.text The HTTP response content is a string, that is, the page content corresponding to the URL
r.encoding Guess the encoding of the response content from the HTTP header
r.apparent_encoding Encoding method of response content analyzed from content (alternative encoding method)
r.content The binary form of the HTTP response content
import requests
r = requests.get("http://www.baidu.com")
type(r)
Copy the code
requests.models.Response
Copy the code
r.status_code
Copy the code

200

r.text
Copy the code

You can see,r.textProperty of the web page data, Chinese is not displayed normally

R.coding is ISO-8859-1, because the value of R.coding comes from the charset property of the header. If charset is not present in the header, the encoding is considered ISO‐8859‐1. See R.haaders

r.headers
Copy the code

{‘ the content-type: text/HTML, ‘the Content – Encoding: “gzip”}

R.ext displays web content according to R.encoding, which displays garbled characters and then looks at R.aparent_encoding, which guesses what encoding should be based on the content in the text

r.encoding = 'utf-8'
r.text
Copy the code

It’s working.

The head () method

>>> r = requests.head('http://httpbin.org/get')
>>> r.headers
{'the Content ‐ Length': '238'.'Access ‐ Control ‐ Allow ‐ Origin': The '*'.'Access ‐ Control ‐ Allow ‐ Credentials': 'true'.'the Content ‐ Type': 'application/json'.'Server': 'nginx'.'Connection': 'keep ‐ alive'.'Date': 'Sat, 18 Feb 2017 12:07:44 GMT'}
>>> r.text
' '
Copy the code

Post () method

The data argument is the data submitted by the PSOT method

POST a dictionary to the URL automatically encoded as a form

>>> payload = {'key1': 'value1'.'key2': 'value2'}
>>> r = requests.post('http://httpbin.org/post', data = payload)
>>> print(r.text)
{  ...
"form": {
"key2": "value2"."key1": "value1"}},Copy the code

POST a string to the URL automatically encoded as data

>>> r = requests.post('http://httpbin.org/post', d  a  t  a = 'ABC')
>>> print(r.text)
{  ...
"data": "ABC"
"form": {},}Copy the code

The other methods are similar

requests.request()methods

requests.request(method, url, **kwargs)

  • Method: indicates the request mode, including get, PUT, and POST
  • Url: The URL link of the page to be obtained
  • **kwargs: 13 parameters to control access

The six HTTP equivalents actually call the Requests method

r = requests.request('GET', url, **kwargs)
r = requests.request('HEAD', url, **kwargs)
r = requests.request('POST', url, **kwargs)
r = requests.request('PUT', url, **kwargs)
r = requests.request('PATCH', url, **kwargs)
r = requests.request('delete', url, **kwargs)
r = requests.request('OPTIONS', url, **kwargs)
Copy the code

Where **kwargs: control access parameters have the following

parameter instructions
params A dictionary or byte sequence added as a parameter to the URL
data A dictionary, byte sequence, or file object as the contents of the Request
json Data in JSON format as the content of the Request
headers Dictionary, HTTP custom header
cookies Dictionary or CookieJar, cookies in Request
auth A tuple that supports HTTP authentication
files Dictionary type, transfer file
timeout Set the timeout period in seconds
proxies Dictionary type, set access proxy server, can add login authentication
allow_redirects : True/False, default True, redirection switch
stream True/False, default is True, get content immediately download switch
verify True/False: indicates whether to authenticate the SSL certificate
cert Local SSL certificate path

Params example

>>> kv = {'key1': 'value1'.'key2': 'value2'}
>>> r = requests.request('GET'.'http://python123.io/ws', params=kv)
>>> print(r.url) http://python123.io/ws? key1=value1&key2=value2Copy the code

The data example

>>> kv = {'key1': 'value1'.'key2': 'value2'}
>>> r = requests.request('POST'.'http://python123.io/ws', data=kv)
>>> body = 'Body content'
>>> r = requests.request('POST'.'http://python123.io/ws', data=body)
Copy the code

Json example

>>> kv = {'key1': 'value1'}
>>> r = requests.request('POST'.'http://python123.io/ws', json=kv)
Copy the code

Headers example

>>> hd = {'the user ‐ agent': 'Chrome/10'}
>>> r = requests.request('POST'.'http://python123.io/ws', headers=hd)
Copy the code

Example files

>>> fs = {'file': open('data.xls'.'rb')}
>>> r = requests.request('POST'.'http://python123.io/ws', files=fs)
Copy the code

The timeout example

>>> r = requests.request('GET'.'http://www.baidu.com', timeout=10)
Copy the code

Proxies example

>>> pxs = { 'http': 'http://user:[email protected]:1234'
'https': 'https://10.10.10.1:4321' }
>>> r = requests.request('GET'.'http://www.baidu.com', proxies=pxs)
Copy the code

Exception handling

abnormal instructions
requests.ConnectionError The network connection is abnormal, such as DNS query failure or connection denial
requests.HTTPError HTTP error exception
requests.URLRequiredURL Abnormal loss
requests.TooManyRedirects The maximum number of redirection times is exceeded, causing a redirection exception. Procedure
requests.ConnectTimeout The connection to the remote server timed out. Procedure
requests.Timeout The REQUEST URL timed out. Procedure

ConnectTimeout is a Timeout in the process of connecting to the server, and Timeout is a Timeout in the process of initiating the request (including the time required before and after connecting to the server).

R.r aise_for_status () function

If a status code other than 200 is returned, a requests.HTTPError exception is raised

A code framework for crawling web pages

A crawling web page picture and save the code case