Using the URllib request module, we can easily implement the request to send and receive the response, this section will look at its specific usage.

1. urlopen()

The urllib.request module provides the most basic method of constructing HTTP requests, using it to simulate a browser request initiation process, and it also has processing of authorization authentication, redirection, browser Cookies and other contents.

Here’s how powerful it is. Here’s an example from the Python website:

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))
Copy the code

Figure 3-1 shows the running result.

Figure 3-1 Running results

Here we use only two lines of code to complete the Python official website crawl and output the source code of the web page. What happens when you get the source code? We want the link, picture address, text information can not be extracted?

Next, let’s see what it actually returns. Output the type of the response using the type() method:

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(type(response))
Copy the code

The following output is displayed:

<class 'http.client.HTTPResponse'>
Copy the code

As you can see, it is an object of type HTTPResposne. It mainly contains methods such as read(), readinto(), getheader(name), getheaders(), fileno(), and attributes such as MSG, version, Status, reason, debuglevel, and closed.

Once we have this object, we assign it to a response variable, and then we can call these methods and properties and get a set of information that returns the result.

For example, calling the read() method returns the content of the page, and calling the status attribute returns a status code for the result, such as 200 for successful request, 404 for page not found, and so on.

Here’s another example:

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
Copy the code

The running results are as follows:

200
[('Server'.'nginx'), ('Content-Type'.'text/html; charset=utf-8'), ('X-Frame-Options'.'SAMEORIGIN'), ('X-Clacks-Overhead'.'GNU Terry Pratchett'), ('Content-Length'.'47397'), ('Accept-Ranges'.'bytes'), ('Date'.'Mon, 01 Aug 2016 09:57:31 GMT'), ('Via'.'1.1 varnish'), ('Age'.'2473'), ('Connection'.'close'), ('X-Served-By'.'cache-lcy1125-LCY'), ('X-Cache'.'HIT'), ('X-Cache-Hits'.'23'), ('Vary'.'Cookie'), ('Strict-Transport-Security'.'max-age=63072000; includeSubDomains')]
nginx
Copy the code

The last output gets the Server value in the response header by calling getheader() and passing the parameter Server. The result is nginx, which means the Server is built with Nginx.

Using the most basic urlopen() method, you can complete the most basic simple web page GET request fetching.

If you want to pass some parameters to a link, how do you do that? Let’s first look at the URlopen () API:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Copy the code

As you can see, in addition to passing the URL as the first parameter, we can also pass other things, such as data, timeout, and so on.

Let’s explain the usage of these parameters in detail.

The data parameter

The data argument is optional. If you want to add this parameter, and if it is the content of a byte stream encoding format, that is, a bytes type, you need to convert it through the bytes() method. In addition, if this parameter is passed, the request is POST instead of GET.

Here’s an example:

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())
Copy the code

Here we pass word as an argument with a value of Hello. It needs to be transcoded to bytes. The bytes() method takes a STR (string) as its first argument. The urlencode() method in the urllib.parse module converts the dictionary of arguments to a string. The second parameter specifies the encoding format, which in this case is UTF8.

The requested site is httpbin.org, which provides TESTING for HTTP requests. The URL we requested this time is httpbin.org/post. This link can be used to test the POST request. It can output some information about the request, including the data parameter we passed.

The running results are as follows:

{
     "args": {},
     "data": ""."files": {},
     "form": {
         "word": "hello"
     },
     "headers": {
         "Accept-Encoding": "identity"."Content-Length": "10"."Content-Type": "application/x-www-form-urlencoded"."Host": "httpbin.org"."User-Agent": "Python - urllib / 3.5"
     },
     "json": null,
     "origin": "123.124.23.253"."url": "http://httpbin.org/post"
}
Copy the code

The parameters we passed appear in the form field, indicating that the data was transferred in POST mode, mimicking the form submission.

The timeout parameter

The timeout parameter is used to set the timeout period, in seconds, meaning that if a request exceeds this specified time and no response is received, an exception will be thrown. If this parameter is not specified, the global default time is used. It supports HTTP, HTTPS, and FTP requests.

Here’s an example:

import urllib.request

response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)
print(response.read())
Copy the code

The running results are as follows:

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/var/py/python/urllibtest.py", line 4, in <module> response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)
...
urllib.error.URLError: <urlopen error timed out>
Copy the code

Here we set the timeout to 1 second. A second later, the server still does not respond, so URLError is raised. This exception belongs to the urllib.error module and is due to timeout.

Therefore, you can set this timeout to control a web page to skip fetching if it does not respond for a long time. This can be done using the try except statement as follows:

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')
Copy the code

Here we request the httpbin.org/get test link, set the timeout to 0.1 seconds, catch URLError, and determine that the exception is of type socket.timeout, so that it is indeed a timeout error and print TIME OUT.

The running results are as follows:

TIME OUT
Copy the code

Generally speaking, it is almost impossible to get a response from the server within 0.1 seconds, so the output of TIME OUT prompt.

It is sometimes useful to set timeout to handle timeouts.

The other parameters

In addition to the data and timeout parameters, there is also the context parameter, which must be of type SSL. SSLContext and is used to specify SSL Settings.

In addition, the cafile and capath parameters specify the CA certificate and its path, respectively, which are useful when requesting HTTPS links.

The cadefault parameter is now deprecated and defaults to False.

The urlopen() method is a very basic way to do simple requests and web fetching. If need more detailed information, see the official document: docs.python.org/3/library/u… .

2. Request

We know that the urlopen() method can be used to initiate the most basic request, but these few simple parameters are not enough to build a complete request. If the Request requires Headers information, it can be built using the more powerful Request class.

First, let’s get a feel for the use of Request with an example:

import urllib.request

request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
Copy the code

As you can see, we still use the urlopen() method to send the Request, only this time instead of a URL, the method takes an object of type Request. By constructing this data structure, we can isolate the request as an object on the one hand, and configure the parameters more richly and flexibly on the other.

Let’s take a look at how a Request can be constructed. It can be constructed as follows:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
Copy the code

  • The first parameterurlUsed to request the URL, this is mandatory, the other parameters are optional.
  • Second parameterdataIf you’re going to pass, you have to passbytesOf the byte stream type. If it’s a dictionary, you can use it firsturllib.parseIn the moduleurlencode()Encoding.
  • The third parameterheadersIs a dictionary, which is the request header that we pass through when constructing the requestheadersParameters are constructed directly or by calling the request instance’sadd_header()Method add. The most common way to add headers is by modifying themUser-AgentTo disguise the browser, by defaultUser-AgentPython-urllib, which we can modify to disguise the browser. For example, to disguise your Firefox browser, you can set it to:

Mozilla / 5.0 (X11; U; Linux Gecko / 20071127 Firefox / 2.0.0.11 i686)

  • The fourth parameterorigin_req_hostRefers to the host name or IP address of the requester.
  • The fifth parameterunverifiableIndicates whether the request is unvalidated. the default isFalseThis means that the user does not have sufficient permission to select the result of receiving the request. For example, if you are asking for an image in an HTML document, and you don’t have the permission to automatically grab the image, then unverifiableThe value isTrue `.
  • The sixth parametermethodIs a string that indicates the methods used in the request, such as GET, POST, and PUT.

Let’s pass in multiple parameters to build the request:

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'the Mozilla / 4.0 (compatible; MSIE 5.5; Windows NT)'.'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
Copy the code

Here we construct a request with four parameters, where URL is the request URL, headers specifies user-agent and Host, and data is converted to a byte stream using the urlencode() and bytes() methods. In addition, the request type is specified as POST.

The running results are as follows:

{
  "args": {}, 
  "data": ""."files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity"."Content-Length": "11"."Content-Type": "application/x-www-form-urlencoded"."Host": "httpbin.org"."User-Agent": "Mozilla / 4.0 (compatible; MSIE 5.5; Windows NT)"
  }, 
  "json": null, 
  "origin": "219.224.169.11"."url": "http://httpbin.org/post"
}
Copy the code

You can see that data, HEADERS, and method have been set successfully.

Alternatively, headers can be added using the add_header() method:

req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent'.'the Mozilla / 4.0 (compatible; MSIE 5.5; Windows NT)')
Copy the code

This makes it much easier to construct and send requests.

3. Advanced usage

In the above process, we can construct requests, but what about more advanced operations (such as Cookies processing, proxy Settings, etc.)?

Next, a more powerful tool called Handler comes into play. In short, we can think of it as a variety of processors, some that deal with login authentication, some that deal with Cookies, and some that deal with proxy Settings. With them, we can do almost anything in an HTTP request.

First, the BaseHandler class in the urllib.request module is the parent of all other handlers and provides the most basic methods, such as default_open(), protocol_request(), etc.

Next, there are various Handler subclasses that inherit from this BaseHandler class, as shown in the following example.

  • HTTPDefaultErrorHandler: used to handle HTTP response errors, which are thrownHTTPErrorType exception.
  • HTTPRedirectHandler: Used to handle redirects.
  • HTTPCookieProcessor: Used to handle Cookies.
  • ProxyHandler: Used to set the proxy. The default proxy is empty.
  • HTTPPasswordMgr: Used to manage passwords. It maintains a table of user names and passwords.
  • HTTPBasicAuthHandler: Used to manage authentication and resolve authentication issues if a link is opened that requires authentication.

In addition, there will be other Handler class, here not list one by one, the details you can refer to the official document: docs.python.org/3/library/u… .

There is no need to worry about how to use them, there will be examples later.

Another important class is OpenerDirector, which we can call Opener. We’ve used the urlopen() method before, which is actually a Opener provided by urllib.

So why introduce Opener? Because you need to implement more advanced functionality. The Request and urlopen() libraries encapsulate very common Request methods for basic requests, but now we need to implement more advanced functionality, so we need to go deeper and configure and use lower-level instances to do this. So Opener is used here.

Opener can use the open() method, which returns the same type as urlopen(). So, what does it have to do with Handler? In short, it uses Handler to build Opener.

Here are a few examples to see how they are used.

validation

When some websites are opened, a dialog box is displayed asking you to enter your user name and password. You can view the web page only after the login is successful, as shown in Figure 3-2.

Figure 3-2 Verification page

So what if you want to request such a page? This can be done with the help of the HTTPBasicAuthHandler:

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000/'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)
Copy the code

Here first instantiation HTTPBasicAuthHandler object, its parameters is HTTPPasswordMgrWithDefaultRealm object, which USES add_password () added the user name and password, then set up a process validation Handler.

Next, use this Handler and use the build_opener() method to build a Opener that will be validated when it sends the request.

Next, use Opener’s open() method to open the link, and the verification can be completed. The result obtained here is the verified source content of the page.

The agent

When doing crawlers, it is inevitable to use proxies. If you want to add proxies, you can do this:

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:9743'.'https': 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)
Copy the code

Here we set up a local agent that runs on port 9743.

The ProxyHandler is used, whose parameters are a dictionary, the key name is the protocol type (such as HTTP or HTTPS, etc.), and the key value is the proxy link, and multiple proxies can be added.

We then use this Handler and the build_opener() method to construct a Opener and send the request.

Cookies

The processing of Cookies requires relevant handlers.

Let’s first use an example to see how to get the Cookies from the website. The relevant code is as follows:

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)
Copy the code

First, we must declare a CookieJar object. Next, we need to build a Handler using HTTPCookieProcessor. Finally, we need to build Opener using the build_opener() method and execute the open() function.

The running results are as follows:

BAIDUID=2E65A683F8A8BA3DF521469DF8EFF1E1:FG=1
BIDUPSID=2E65A683F8A8BA3DF521469DF8EFF1E1
H_PS_PSSID=20987_1421_18282_17949_21122_17001_21227_21189_21161_20927
PSTM=1474900615
BDSVRTM=0
BD_HOME=0
Copy the code

As you can see, the name and value of each Cookie are printed.

But since it can be output, can it be output in file format? We know that Cookies are actually saved as text as well.

Of course the answer is yes, here’s an example:

filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
Copy the code

The CookieJar needs to be replaced by Mozilla CookkieJar, a subclass of the CookieJar used for generating files that handle Cookies and file-related events, such as reading and saving Cookies, You can save Cookies in the Mozilla browser’s Cookies format.

After running, you can see that a cookie. TXT file has been generated with the following contents:

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file! Do not edit.

.baidu.com    TRUE    /    FALSE    3622386254    BAIDUID    05AE39B5F56C1DEC474325CDA522D44F:FG=1
.baidu.com    TRUE    /    FALSE    3622386254    BIDUPSID    05AE39B5F56C1DEC474325CDA522D44F
.baidu.com    TRUE    /    FALSE        H_PS_PSSID    19638_1453_17710_18240_21091_18560_17001_21191_21161
.baidu.com    TRUE    /    FALSE    3622386254    PSTM    1474902606
www.baidu.com    FALSE    /    FALSE        BDSVRTM    0
www.baidu.com    FALSE    /    FALSE        BD_HOME    0
Copy the code

In addition, LWP cookiejar can also read and save Cookies, but in a different format than MozillaCookieJar. It saves Cookies in libwww-perl(LWP) format.

To save Cookies in LWP format, you can change them to:

cookie = http.cookiejar.LWPCookieJar(filename)
Copy the code

The generated content is as follows:

# LWP - Cookies - 2.0
Set-Cookie3: BAIDUID="0CE9C56F598E69DB375B7C294AE5C591:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2084-10-14 18:25:19Z"; version=0
Set-Cookie3: BIDUPSID=0CE9C56F598E69DB375B7C294AE5C591; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2084-10-14 18:25:19Z"; version=0
Set-Cookie3: H_PS_PSSID=20048_1448_18240_17944_21089_21192_21161_20929; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: PSTM=1474902671; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2084-10-14 18:25:19Z"; version=0
Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Copy the code

From this point of view, the generated format is quite different.

So, how do you read and exploit Cookies from the generated file?

Let’s take the LWPCookieJar format as an example:

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
Copy the code

As you can see, the load() method is called to read the local Cookies file and retrieve the contents of the Cookies. However, the premise is that we first generate Cookies in LWPCookieJar format and save them into files, and then use the same method to construct Handler and Opener after reading Cookies to complete the operation.

If the results are normal, it will output the source code of Baidu web pages.

With the above method, we can implement most of the requested function Settings.

This is the basic usage of request in the urllib library module, if you want to realize more functions, you can refer to the official documentation: docs.python.org/3/library/u… .


This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)