preface

For more, please visit my personal blog.


Urllib, urllib2, urllib3, httplib, httplib2, requests.

  • The built-inurllibThe module
    • Advantages: Built-in modules, no need to download third-party libraries
    • Disadvantages: Cumbersome operation, lack of advanced functions
  • Third-party librariesrequests
    • Advantages: Handling URL resources is particularly convenient
    • Disadvantages: Need to download and install third-party libraries

The built-inurllibThe module

Making a GET request

The urlopen() method is used to initiate the request, as follows:

from urllib import request

resp = request.urlopen('http://www.baidu.com')
print(resp.read().decode())
Copy the code

The result is an HTTP.client.httpresponse object whose read() method lets you retrieve the data from the web page. Note, however, that the retrieved data will be in binary bytes format, so decode() will be needed to convert it to string format.

Making a POST request

The default access to urlopen() is GET, and when the data argument is passed to the urlopen() method, a POST request is made. Note: The data passed must be in bytes format.

Setting the timeout parameter also sets the timeout period, and if the request time exceeds, an exception will be thrown. As follows:

from urllib import request

resp = request.urlopen('http://www.baidu.com', data=b'word=hello', timeout=10)
print(resp.read().decode())
Copy the code

Add Headers

The default Headers for urllib requests is “user-agent “:” python-urllib /3.6″, indicating that urllib sent the request. Therefore, we need to customize Headers for sites that validate user-agent, which requires the request object in urllib.request.

from urllib import request

url = 'http://httpbin.org/get'
headers = {'user-agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}

You need to generate a Request object using the URL and headers and pass it into the urlopen method
req = request.Request(url, headers=headers)
resp = request.urlopen(req)
print(resp.read().decode())
Copy the code

The Request object

As shown above, the urlopen() method can pass not only a url in string form, but also a Request object to extend the functionality, which looks like this:

class urllib.request.Request(url, data=None, headers={},
                                origin_req_host=None,
                                unverifiable=False, 
                                method=None)
Copy the code

To construct the Request object, you must pass in url parameters. Data data and HEADERS are optional.

Finally, the Request method can take the method argument to select the requested method, such as PUT, DELETE, and so on. The default is GET.

Add a Cookie

In order to take Cookie information with the request, we need to reconstruct a opener.

We use the request.build_opener method to construct opener, configure the cookie we want to pass into opener, and then use the opener’s open method to initiate the request. As follows:

from http import cookiejar
from urllib import request

url = 'https://www.baidu.com'
Create a CookieJar object
cookie = cookiejar.CookieJar()
Create a cookie processor using HTTPCookieProcessor
cookies = request.HTTPCookieProcessor(cookie)
And create the Opener object with it as an argument
opener = request.build_opener(cookies)
Use this opener to initiate a request
resp = opener.open(url)

# View the previous cookie object, you can see the cookie obtained by accessing Baidu
for i in cookie:
    print(i)
Copy the code

Alternatively, the generated opener can be set to global using the install_opener method.

The cookie is always attached to subsequent requests made using the Urlopen method.

Set this opener to the global opener
request.install_opener(opener)
resp = request.urlopen(url)
Copy the code

Setting a Proxy

When using crawlers to crawl data, we often need to use proxies to hide our real IP. As follows:

from urllib import request

url = 'http://www.baidu.com'
proxy = {'http':'222.222.222.222:80'.'https':'222.222.222.222:80'}
Create a proxy handler
proxies = request.ProxyHandler(proxy)
Create opener object
opener = request.build_opener(proxies)

resp = opener.open(url)
print(resp.read().decode())
Copy the code

Download data locally

We often need to save data such as images or audio locally when making network requests. One way to do this is to use Python’s file operation, which saves data retrieved from read() to a file.

Urllib provides an urlRetrieve () method that simply saves the requested data directly into a file. As follows:

from urllib import request

url = 'http://python.org/'
request.urlretrieve(url, 'python.html')
Copy the code

The second argument passed in to the urlRetrieve () method is the location where the file is saved, along with the file name.

Note: the urlretrieve() method is directly ported from python2 and may be deprecated in a later version.

Third-party librariesrequests

The installation

Since Requests is a third-party library, install it first, as follows:

pip install requests
Copy the code

Making a GET request

Use the get method directly as follows:

import requests

r = requests.get('http://www.baidu.com/')
print(r.status_code)    # state
print(r.text)   # content
Copy the code

For urls with parameters, pass a dict as a params parameter, as follows:

import requests

r = requests.get('http://www.baidu.com/', params={'q': 'python'.'cat': '1001'})
print(r.url)    The actual requested URL
print(r.text)
Copy the code

The additional convenience of Requests is that for specific types of responses, such as JSON, they can be obtained directly, as follows:

r = requests.get('https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%20%3D%202151330&format =json')
r.json()

# {'query': {'count': 1, 'created': '2017-11-17T07:14:12Z', ...
Copy the code

Add Headers

When we need to pass HTTP headers, we pass a dict as the headers argument, as follows:

r = requests.get('https://www.baidu.com/', headers={'User-Agent': 'the Mozilla / 5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'})
Copy the code

Get the response header as follows:

r.headers
# {Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Content-Encoding': 'gzip', ... }

r.headers['Content-Type']
# 'text/html; charset=utf-8'
Copy the code

Making a POST request

To send a POST request, simply change the get() method to POST () and pass in the data argument as the data for the POST request, as follows:

r = requests.post('https://accounts.baidu.com/login', data={'form_email': '[email protected]'.'form_password': '123456'})
Copy the code

Requests uses Application/X-www-form-urlencoded POST data by default. If you want to pass JSON data, you can pass JSON parameters as follows:

params = {'key': 'value'}
r = requests.post(url, json=params) # Internal automatic serialization to JSON
Copy the code

Upload a file

Uploading files requires a more complex encoding format, but Requests simplified it to the files argument as follows:

upload_files = {'file': open('report.xls'.'rb')}
r = requests.post(url, files=upload_files)
Copy the code

When reading a file, be sure to use ‘rb’, or binary mode, so that the bytes retrieved are the length of the file.

Replace the post() method with put(), delete(), and so on to request resources as put or DELETE.

Add a Cookie

To pass cookies in a request, simply prepare a dict to pass cookies as follows:

cs = {'token': '12345'.'status': 'working'}
r = requests.get(url, cookies=cs)
Copy the code

Requests treats cookies so that we can easily retrieve the specified cookies without parsing them, as follows:

r.cookies['token']
# 12345
Copy the code

Specify the timeout

To specify a timeout, pass in the timeout argument in seconds. Timeout is divided into connection timeout and read timeout, as follows:

try:
    Connection timeout after 3.1 seconds, read timeout after 27 secondsR = requests. Get (url, timeout = (3.1, 27)) except requests. Exceptions. RequestException as e:print(e)
Copy the code

Timeout reconnection

def gethtml(url):
    i = 0
    while i < 3:
        try:
            html = requests.get(url, timeout=5).text
            return html
        except requests.exceptions.RequestException:
            i += 1
Copy the code

Add the agent

If the headers method is added, the proxy argument should also be a dict, as follows:

heads = {
    'User-Agent': 'the Mozilla / 5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'
}
proxy = {
    'http': 'http://120.25.253.234:812'.'https' 'https://163.125.222.244:8123'
}
r = requests.get('https://www.baidu.com/', headers=heads, proxies=proxy)
Copy the code

More programming teaching please pay attention to the public account: Pan Gao accompany you to learn programming