[2022 年] Python3 crawler tutorial - Urllib crawler experience

😀 This is the 5th original crawler column

We’ll start with a Python library called URllib, which allows us to send HTTP requests without worrying about the HTTP protocol itself or even the lower-level implementation. Urllib can also convert the response returned by the server into a Python object, from which we can easily obtain information about the response, such as the response status code, response header, response body, and so on.

Note: In Python 2, there are two libraries for sending requests: urllib and urllib2. In Python 3, no longer exists urllib2 this library, unified urllib, it is the official document links: docs.python.org/3/library/u… .

First, let’s look at the use of the URllib library, which is Python’s built-in HTTP request library, meaning no additional installation is required. It contains the following four modules.

Request: This is the most basic HTTP request module that can be used to simulate sending a request. Just like typing the URL in the browser and hitting Enter, you can simulate this process by passing the URL and additional parameters to the library method.
Error: An exception handling module that can catch exceptions if a request error occurs and retry or do something else to ensure that the program does not terminate unexpectedly.
Parse: A tool module that provides a number of URL handling methods, such as splitting, parsing, and merging.
Robotparser: Mainly used to identify robots.txt files of web sites and determine which sites can and cannot be climbed, robotParser is used very little.

1. Send the request

Using urllib’s request module, we can easily send a request and get a response. Let’s look at how it’s used.

`urlopen`

Urllib. request module provides the most basic method of constructing HTTP request, using it can simulate a browser request initiation process, It also handles Authentication, Redirection, browser cookies, and more.

Here’s how powerful it is. Here’s an example from the Python website:

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))
Copy the code

The running result is shown in the figure.

Figure Running results

Here we use only two lines of code to complete the Python official website crawl and output the source code of the web page. What happens when you get the source code? We want the link, picture address, text information can not be extracted?

Next, let’s see what it actually returns. Output the type of the response using the type method:

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(type(response))
Copy the code

The following output is displayed:

<class 'http.client.HTTPResponse'>
Copy the code

HTTPResposne: HTTPResposne: HTTPResposne: HTTPResposne: HTTPResposne: HTTPResposne: HTTPResposne: HTTPResposne: HTTPResposne And MSG, Version, Status, Reason, debugLevel, closed and other attributes.

Once we have this object, we assign it to a response variable, and then we can call these methods and properties and get a set of information that returns the result.

For example, calling the read method returns the content of the page, and calling the status attribute returns a status code for the result, such as 200 for successful request, 404 for page not found, and so on.

Here’s another example:

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
Copy the code

The running results are as follows:

200
[('Server'.'nginx'), ('Content-Type'.'text/html; charset=utf-8'), ('X-Frame-Options'.'DENY'), ('Via'.'1.1 vegur'), ('Via'.'1.1 varnish'), ('Content-Length'.'48775'), ('Accept-Ranges'.'bytes'), ('Date'.'Sun, 15 Mar 2020 13:29:01 GMT'), ('Via'.'1.1 varnish'), ('Age'.'708'), ('Connection'.'close'), ('X-Served-By'.'cache-bwi5120-BWI, cache-tyo19943-TYO'), ('X-Cache'.'HIT, HIT'), ('X-Cache-Hits'.'2, 518'), ('X-Timer'.'S1584278942.717942, VS0, VE0'), ('Vary'.'Cookie'), ('Strict-Transport-Security'.'max-age=63072000; includeSubDomains')]
nginx
Copy the code

The last output gets the Server value in the response header by calling the getheader method and passing the parameter Server. The result is nginx, which means that the Server is built with Nginx.

Using the most basic urlopen method, you can complete the most basic simple web page GET request fetching.

If you want to pass some parameters to a link, how do you do that? First look at the URlopen method API:

urllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)
Copy the code

As you can see, in addition to passing the URL as the first parameter, we can also pass other things, such as data, timeout, and so on.

We will explain the usage of these parameters in detail below.

`data`parameter

The data argument is optional. If you want to add this parameter, you use the bytes method to convert the parameter to something in byte stream encoding format, that is, a bytes type. In addition, if this parameter is passed, the request is POST instead of GET.

Here’s an example:

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'name': 'germey'}), encoding='utf-8')
response = urllib.request.urlopen('https://httpbin.org/post', data=data)
print(response.read().decode('utf-8'))
Copy the code

Here we pass word as an argument with a value of Hello. It needs to be transcoded to bytes. The bytes method takes a STR (string) as the first parameter. The urlencode method in the urllib.parse module converts the dictionary of parameters into a string. The second parameter specifies the encoding format, which in this case is UTF-8.

The requested site is httpbin.org, which provides TESTING for HTTP requests. The URL we requested this time is httpbin.org/post. This link can be used to test the POST Request. It can output some information about the Request, including the data parameter we passed.

The running results are as follows:

{
  "args": {},
  "data": ""."files": {},
  "form": {
    "name": "germey"
  },
  "headers": {
    "Accept-Encoding": "identity"."Content-Length": "11"."Content-Type": "application/x-www-form-urlencoded"."Host": "httpbin.org"."User-Agent": "Python - urllib / 3.7"."X-Amzn-Trace-Id": "Root=1-5ed27e43-9eee361fec88b7d3ce9be9db"
  },
  "json": null."origin": "17.220.233.154"."url": "https://httpbin.org/post"
}
Copy the code

The parameters we passed appear in the form field, indicating that the data was transferred in POST mode, mimicking the form submission.

`timeout`parameter

The timeout parameter is used to set the timeout period, in seconds, meaning that if a request exceeds this specified time and no response is received, an exception will be thrown. If this parameter is not specified, the global default time is used. It supports HTTP, HTTPS, and FTP requests.

Here’s an example:

import urllib.request

response = urllib.request.urlopen('https://httpbin.org/get', timeout=0.1)
print(response.read())
Copy the code

The result might look like this:

During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/var/py/python/urllibtest.py", line 4.in <module> response =
urllib.request.urlopen('https://httpbin.org/get', timeout=0.1)... urllib.error.URLError: <urlopen error _ssl.c:1059: The handshake operation timed out>
Copy the code

Here we set the timeout to 1 second. One second after the program runs, the server still does not respond, so the URLError exception is raised. This exception belongs to the urllib.error module and is due to timeout.

Therefore, you can set this timeout to control a web page to skip fetching if it does not respond for a long time. This can be done using the try… Except statement to implement, related code as follows:

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')
Copy the code

Here we request the httpbin.org/get test link, set the timeout time to 0.1 seconds, catch URLError, and determine the exception type is socket.timeout, which means timeout exception. Therefore, it is true that it failed because of a timeout and prints TIME OUT.

The running results are as follows:

TIME OUT
Copy the code

Generally speaking, it is almost impossible to get a response from the server within 0.1 seconds, so the output of TIME OUT prompt.

It is sometimes useful to set timeout to handle timeouts.

The other parameters

In addition to the data and timeout parameters, there is also the context parameter, which must be of type SSL. SSLContext and is used to specify SSL Settings.

In addition, the cafile and capath parameters specify the CA certificate and its path, respectively, which are useful when requesting HTTPS links.

The cadefault parameter is now deprecated and defaults to False.

The urlopen method is a very basic way to do simple requests and crawls. If need more detailed information, see the official document: docs.python.org/3/library/u… .

`Request`

We know that the urlopen method can be used to initiate the most basic request, but these few simple parameters are not enough to build a complete request. If the Request requires Headers information, it can be built using the more powerful Request class.

First, let’s use an example to get a feel for the Request class:

import urllib.request

request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
Copy the code

As you can see, we still use the urlopen method to send the Request, but this time instead of a URL, the method takes an object of type Request. By constructing this data structure, we can isolate the request as an object on the one hand, and configure the parameters more richly and flexibly on the other.

Let’s take a look at how a Request can be constructed. It can be constructed as follows:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
Copy the code

The first parameter url is used to request the URL, which is mandatory, and the other parameters are optional.

The second argument, data, must be of type bytes if passed. If it is a dictionary, you can start with urlencode() in the urllib.parse module.

The third argument, headers, is a dictionary, which is the request header. When constructing a request, we can either construct it directly with the headers parameter or add it by calling the add_header() method of the request instance.

The most common way to add headers is by modifying user-Agent to disguise the browser. The default user-agent is python-urllib, which can be modified to disguise the browser. For example, to disguise your Firefox browser, you can set it to:

Mozilla / 5.0 (X11; U; Linux Gecko / 20071127 Firefox / 2.0.0.11 i686)Copy the code

The fourth parameter origin_req_host refers to the host name or IP address of the requester.

The fifth parameter, unverifiable, indicates whether the request is unverifiable. The default value is False, meaning the user does not have enough privileges to accept the result of the request. For example, if you are asking for an image in an HTML document, and you don’t have the right to automatically grab the image, the unverifiable value is True.

The sixth argument, method, is a string that indicates the methods used in the request, such as GET, POST, and PUT.

Here we pass in multiple parameters to build the request:

from urllib import request, parse

url = 'https://httpbin.org/post'
headers = {
    'User-Agent': 'the Mozilla / 4.0 (compatible; MSIE 5.5; Windows NT)'.'Host': 'httpbin.org'
}
dict = {'name': 'germey'}
data = bytes(parse.urlencode(dict), encoding='utf-8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
Copy the code

Here we construct a request with four parameters, where URL is the request URL, headers specifies user-agent and Host, and data is converted to a byte stream using the urlencode and bytes methods. In addition, the request type is specified as POST.

The running results are as follows:

{
  "args": {},
  "data": ""."files": {},
  "form": {
    "name": "germey"
  },
  "headers": {
    "Accept-Encoding": "identity"."Content-Length": "11"."Content-Type": "application/x-www-form-urlencoded"."Host": "httpbin.org"."User-Agent": "Mozilla / 4.0 (compatible; MSIE 5.5; Windows NT)"."X-Amzn-Trace-Id": "Root=1-5ed27f77-884f503a2aa6760df7679f05"
  },
  "json": null."origin": "17.220.233.154"."url": "https://httpbin.org/post"
}
Copy the code

You can see that data, HEADERS, and method have been set successfully.

Alternatively, headers can be added using the add_header method:

req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent'.'the Mozilla / 4.0 (compatible; MSIE 5.5; Windows NT)')
Copy the code

This makes it much easier to construct and send requests.

Advanced usage

In the above process, we can construct requests, but what about more advanced operations (such as Cookies handling, proxy Settings, etc.)?

Next, a more powerful tool called Handler comes into play. In short, we can think of it as a variety of processors, some that deal with login authentication, some that deal with cookies, and some that deal with proxy Settings. With them, we can do almost anything in an HTTP request.

First, the BaseHandler class in the urllib.request module is the parent of all other handlers. It provides basic methods such as default_open, PROTOCOL_REQUEST, and so on.

Next, there are various Handler subclasses that inherit from this BaseHandler class, as shown in the following example.

HTTPDefaultErrorHandlerUsed to handle HTTP response errors, which are thrownHTTPErrorType exception.
HTTPRedirectHandlerUsed to handle redirects.
HTTPCookieProcessorUsed to process Cookies.
ProxyHandlerUsed to set the proxy. Default proxy is null.
HTTPPasswordMgrFor managing passwords, it maintains a table of user names and passwords.
HTTPBasicAuthHandlerUsed to manage authentication and solve authentication problems if a link is opened that requires authentication.

In addition, there will be other Handler class, here not list one by one, the details you can refer to the official document: docs.python.org/3/library/u… .

There is no need to worry about how to use them, there will be examples later.

Another important class is OpenerDirector, which we can call Opener. We’ve used the urlopen method before, which is actually a Opener provided by urllib.

So why introduce Opener? Because you need to implement more advanced functionality. Request and Urlopen are libraries that encapsulate very common Request methods for basic requests, but now we need to implement more advanced functionality, so we need to go deeper and configure and use lower level instances to do things. So Opener is used here.

Opener can use the open method and return the same type as urlopen. So, what does it have to do with Handler? In short, it uses Handler to build Opener.

Here are a few examples to see how they are used.

validation

When visiting certain sites with authentication Settings, such as ssr3.scrape. Center /, we may encounter something like this… 2 – shown below:

Figure 2- Authentication window

If this is the case, the site has enabled Basic Authentication, or HTTP Basic Access Authentication, a form of login Authentication that allows web browsers or other client programs to provide credentials in the form of a user name and password upon request.

So what if you want to request such a page? This can be done with the help of the HTTPBasicAuthHandler:

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'admin'
password = 'admin'
url = 'https://ssr3.scrape.center/'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)
Copy the code

Here first instantiation HTTPBasicAuthHandler object, its parameters is HTTPPasswordMgrWithDefaultRealm object, which USES add_password methods add user name and password, This sets up a Handler to handle validation.

Next, use this Handler and use the build_opener method to build a Opener that is already validated when it sends the request.

Next, use Opener’s open method to open the link, and the verification can be completed. The result obtained here is the verified source content of the page.

The agent

When doing crawlers, it is inevitable to use proxies. If you want to add proxies, you can do this:

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:8080'.'https': 'https://127.0.0.1:8080'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)
Copy the code

Here, we need to set up an HTTP proxy locally and run it on port 8080.

The ProxyHandler is used, whose parameters are a dictionary, the key name is the protocol type (such as HTTP or HTTPS, etc.), and the key value is the proxy link, and multiple proxies can be added.

Then, use this Handler and the build_opener method to construct a Opener and send the request.

Cookie

Cookie processing requires the associated Handler.

We first use an example to see how to get down the website Cookie, the relevant code is as follows:

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
for item in cookie:
    print(item.name + "=" + item.value)
Copy the code

First, we must declare a CookieJar object. Next, we need to use HTTPCookieProcessor to build a Handler, and finally use build_opener method to build Opener, execute the open function.

The running results are as follows:

BAIDUID=A09E6C4E38753531B9FB4C60CE9FDFCB:FG=1
BIDUPSID=A09E6C4E387535312F8AA46280C6C502
H_PS_PSSID=31358_1452_31325_21088_31110_31253_31605_31271_31463_30823
PSTM=1590854698
BDSVRTM=10
BD_HOME=1
Copy the code

As you can see, the name and value of each Cookie entry is printed.

But since it can be output, can it be output in file format? We know that cookies are actually stored as text as well.

Of course the answer is yes, here’s an example:

import urllib.request, http.cookiejar

filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
Copy the code

The CookieJar needs to be replaced by Mozilla CookkieJar, which is used for generating files and is a subclass of CookieJar that handles Cookie and file-related events, such as reading and saving cookies, You can save cookies in the Mozilla browser Cookie format.

After running it, you can see that a cookie. TXT file is generated with the following contents:

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com	TRUE	/	FALSE	1622390755	BAIDUID	0B4A68D74B0C0E53E5B82AFD9BF9178F:FG=1
.baidu.com	TRUE	/	FALSE	3738338402	BIDUPSID	0B4A68D74B0C0E53471FA6329280FA58
.baidu.com	TRUE	/	FALSE		H_PS_PSSID	31262_1438_31325_21127_31110_31596_31673_31464_30823_26350
.baidu.com	TRUE	/	FALSE	3738338402	PSTM	1590854754
www.baidu.com	FALSE	/	FALSE		BDSVRTM	0
www.baidu.com	FALSE	/	FALSE		BD_HOME	1
Copy the code

In addition, LWP cookiejar can also read and save cookies, but in a different format than MozillaCookieJar. It saves cookies in libwww-perl (LWP) format.

To save the Cookie file in LWP format, you can change it to:

cookie = http.cookiejar.LWPCookieJar(filename)
Copy the code

The generated content is as follows:

# LWP - Cookies - 2.0 - Set - Cookie3: BAIDUID = "1 f30eeda35c7a94320275f991ca5b3a5: FG = 1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2021-05-30 16:06:39Z"; comment=bd; version=0 Set-Cookie3: BIDUPSID=1F30EEDA35C7A9433C97CF6245CBC383; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2088-06-17 19:20:46Z"; version=0 Set-Cookie3: H_PS_PSSID=31626_1440_21124_31069_31254_31594_30841_31673_31464_31715_30823; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0 Set-Cookie3: PSTM=1590854799; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2088-06-17 19:20:46Z"; version=0 Set-Cookie3: BDSVRTM=11; path="/"; domain="www.baidu.com"; path_spec; discard; version=0 Set-Cookie3: BD_HOME=1; path="/"; domain="www.baidu.com"; path_spec; discard; version=0Copy the code

From this point of view, the generated format is quite different.

So, once a Cookie file is generated, how can it be read from the file and exploited?

Let’s take the LWPCookieJar format as an example:

import urllib.request, http.cookiejar

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))
Copy the code

As you can see, the load method is called to read the local Cookie file and get the contents of the Cookie. However, the premise is that we first generate cookies in LWPCookieJar format and save them into files, and then use the same method to construct Handler and Opener after reading cookies to complete the operation.

If the results are normal, it will output the source code of Baidu web pages.

With the above method, we can implement most of the requested function Settings.

This is the basic usage of request in the urllib library module, if you want to realize more functions, you can refer to the official documentation: docs.python.org/3/library/u… .

2. Handle exceptions

In the previous section, we saw how requests are sent, but what if an exception occurs in the case of a bad network? If you do not handle these exceptions, the program is likely to terminate because of an error, so exception handling is very necessary.

Urllib’s error module defines exceptions raised by the Request module. If there is a problem, the Request module throws an exception defined in the error module.

`URLError`

The URLError class comes from the error module of urllib library. It inherits from the OSError class and is the base class of the error exception module. Exceptions generated by the Request module can be handled by catching this class.

It has one attribute, Reason, which returns the reason for the error.

Here’s an example:

from urllib import request, error

try:
    response = request.urlopen('https://cuiqingcai.com/404')
except error.URLError as e:
    print(e.reason)
Copy the code

We opened a page that didn’t exist, which should have reported an error, but we caught the URLError exception and ran it as follows:

Not Found
Copy the code

The program does not report an error directly, but prints the above content so that the program can avoid abnormal termination and the exception is handled efficiently.

`HTTPError`

It is a subclass of URLError and is designed to handle HTTP request errors, such as authentication request failures. It has the following three properties.

code: Returns the HTTP status code. For example, 404 indicates that the web page does not exist, and 500 indicates that the server has an internal error.
reason: as with the parent class, used to return the cause of the error.
headers: returns the request header.

Here are a few examples:

from urllib import request, error

try:
    response = request.urlopen('https://cuiqingcai.com/404')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
Copy the code

The running results are as follows:

Not Found 404 Server: nginx/1.10.3 (Ubuntu) Date: Sat, 30 May 2020 16:08:42 GMT Content-Type: text/ HTML; charset=UTF-8 Transfer-Encoding: chunked Connection: close Set-Cookie: PHPSESSID=kp1a1b0o3a0pcf688kt73gc780; path=/ Pragma: no-cache Vary: Cookie Expires: Wed, 11 Jan 1984 05:00:00 GMT Cache-Control: no-cache, must-revalidate, max-age=0 Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"Copy the code

This is the same url that catches an HTTPError exception and outputs the Reason, code, and headers attributes.

Since URLError is the parent class of HTTPError, we can choose to catch the subclass error first and then the parent class error, so a better way to write this code is as follows:

from urllib import request, error

try:
    response = request.urlopen('https://cuiqingcai.com/404')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')
Copy the code

HTTPError: HTTPError: error, status code, headers, etc. If it is not an HTTPError exception, the URLError exception is caught and the reason for the error is printed. Finally, use else to handle normal logic. This is a better way to write exception handling.

Sometimes, the Reason attribute returns not necessarily a string, but an object. Take a look at the following example:

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')
Copy the code

Here we simply set the timeout to force a timeout exception to be thrown.

The running results are as follows:

<class'socket.timeout'>
TIME OUT
Copy the code

As you can see, the result of the Reason attribute is the socket.timeout class. So, here we can use the isinstance method to determine its type and make a more detailed exception judgment.

In this section, we describe the use of the error module. By properly capturing exceptions, we can make more accurate exception judgments and make programs more robust.

3. Parse links

As mentioned earlier, the URllib library also provides the Parse module, which defines a standard interface for working with urls, such as extracting, merging, and transforming urls. It supports URL handling for the following protocols: File, FTP, gopher, HDL, HTTP, HTTPS, IMAP, Mailto, MMS, news, NNTP, Prospero, rsync, RTSP, RTSPu, SFTP, SIP, SIPS, Snews, SVN, SVN + SSH, TELne T and wais. In this section, we take a look at the common methods used in this module to see how convenient it is.

`urlparse`

This method can realize URL recognition and segmentation, here is a case study:

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html; user? id=5#comment')
print(type(result))
print(result)
Copy the code

Here we use the URlparse method to parse a URL. First, the type of the parse result is printed, and then the result is printed as well.

The running results are as follows:

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
Copy the code

As you can see, the result is an object of type ParseResult with six parts: Scheme, Netloc, Path, Params, Query, and Fragment.

Take a look at the URL of this instance:

https://www.baidu.com/index.html; user? id=5#commentCopy the code

As you can see, the URlParse method splits it into six parts. A general look shows that there are specific delimiters for parsing. For example, :// Before the scheme, representing the protocol; The first/is preceded by netloc (domain name), followed by path (access path). A semicolon. Params, parameters; The question mark? Query follows the query condition, which is usually used as a URL of type GET. The hashtag # is followed by anchor points that are used to directly locate dropdowns within the page.

Therefore, a standard link format can be obtained as follows:

scheme://netloc/path; params? query#fragmentCopy the code

A standard URL conforms to this rule, and it can be split using the URlParse method.

Is there any other configuration for the URlParse method besides this most basic parsing? Next, take a look at its API usage:

urllib.parse.urlparse(urlstring, scheme=' ', allow_fragments=True)
Copy the code

As you can see, it takes three arguments.

urlstring: This is mandatory, that is, the URL to be parsed.
scheme: it is the default protocol (e.ghttp 或 https, etc.). If the link does not carry protocol information, this will be the default protocol. Let’s use an example:

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html; user? id=5#comment', scheme='https')
print(result)
Copy the code

The running results are as follows:

ParseResult(scheme='https', netloc=' ', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')
Copy the code

As you can see, the URL we provide does not contain the uppermost Scheme information, but with the default scheme argument, HTTPS is returned.

Suppose we bring Scheme with us:

result = urlparse('http://www.baidu.com/index.html; user? id=5#comment', scheme='https')
Copy the code

The results are as follows:

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
Copy the code

The scheme parameter takes effect only when the URL does not contain Scheme information. If the URL contains scheme information, the resolved scheme is returned.

allow_fragments: Indicates whether to ignorefragment. If it is set toFalse.fragmentThe part will be ignored, and it will be resolved topath,parametersorqueryWhilefragmentPart is empty.

Here’s an example:

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html; user? id=5#comment', allow_fragments=False)
print(result)
Copy the code

The running results are as follows:

ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment=' ')
Copy the code

Assuming that the URL does not contain params and query, let’s look at an example:

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html#comment', allow_fragments=False)
print(result)
Copy the code

The running results are as follows:

ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html#comment', params=' ', query=' ', fragment=' ')
Copy the code

As you can see, when the URL does not contain Params and Query, the fragment is resolved as part of the Path.

The result ParseResult is actually a tuple that can be retrieved either by index order or by attribute name. The following is an example:

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html#comment', allow_fragments=False)
print(result.scheme, result[0], result.netloc, result[1], sep='\n')
Copy the code

Here we get Scheme and Netloc with index and attribute name respectively. The result is as follows:

https
https
www.baidu.com
www.baidu.com
Copy the code

It can be found that the results of both methods are consistent, and both methods can be successfully obtained.

`urlunparse`

With the urlparse method, there is its counterpart, urlunparse. The argument it takes is an iterable, but it must be 6 in length, or it will throw up insufficient or too many arguments. Let’s start with an example:

from urllib.parse import urlunparse

data = ['https'.'www.baidu.com'.'index.html'.'user'.'a=6'.'comment']
print(urlunparse(data))
Copy the code

Here the parameter data uses a list type. Of course, you can also use other types, such as tuples or specific data structures.

The running results are as follows:

https://www.baidu.com/index.html; user? a=6#commentCopy the code

Thus we have successfully implemented the CONSTRUCTION of the URL.

`urlsplit`

This method is very similar to the URlParse method, except that it no longer parses the params section separately, returning only five results. The params in the above example will be merged into path. The following is an example:

from urllib.parse import urlsplit

result = urlsplit('https://www.baidu.com/index.html; user? id=5#comment')
print(result)
Copy the code

The running results are as follows:

SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html; user', query='id=5', fragment='comment')
Copy the code

As you can see, the return result is SplitResult, which is also a tuple type that can be retrieved using either an attribute or an index. The following is an example:

from urllib.parse import urlsplit

result = urlsplit('https://www.baidu.com/index.html; user? id=5#comment')
print(result.scheme, result[0])
Copy the code

The running results are as follows:

https https
Copy the code

`urlunsplit`

Like the Urlunparse method, it combines the parts of a link into a complete link, passing in an iterable as an argument, such as a list, tuple, and so on, with the only difference being that the length must be 5. The following is an example:

from urllib.parse import urlunsplit

data = ['https'.'www.baidu.com'.'index.html'.'a=6'.'comment']
print(urlunsplit(data))
Copy the code

The running results are as follows:

https://www.baidu.com/index.html?a=6#comment
Copy the code

`urljoin`

With the urlunparse and urlunsplit methods, we can merge links, but only if there are objects of a specific length and each part of the link is clearly separated.

In addition, there is another way to generate links, which is the URlJoin method. We can provide a base_URL (base link) as the first argument and the new link as the second argument. This method will analyze the Scheme, Netloc, and PATH contents of the base_URL and supplement the missing parts of the new link, and finally return the result.

Here are a few examples:

from urllib.parse import urljoin

print(urljoin('https://www.baidu.com'.'FAQ.html'))
print(urljoin('https://www.baidu.com'.'https://cuiqingcai.com/FAQ.html'))
print(urljoin('https://www.baidu.com/about.html'.'https://cuiqingcai.com/FAQ.html'))
print(urljoin('https://www.baidu.com/about.html'.'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('https://www.baidu.com?wd=abc'.'https://cuiqingcai.com/index.php'))
print(urljoin('https://www.baidu.com'.'? category=2#comment'))
print(urljoin('www.baidu.com'.'? category=2#comment'))
print(urljoin('www.baidu.com#comment'.'? category=2'))
Copy the code

The running results are as follows:

https://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
https://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2
Copy the code

As you can see, Base_URL provides scheme, Netloc, and PATH. If these three items do not exist in the new link, add them; If a new link exists, the part of the new link is used. Params, Query, and Fragments in base_URL do not work.

Through urlJoin method, we can easily achieve link parsing, assembling and generation.

`urlencode`

Here we introduce another common method, urlencode, which is useful for constructing GET request parameters, as shown in the following example:

from urllib.parse import urlencode

params = {
    'name': 'germey'.'age': 25
}
base_url = 'https://www.baidu.com?'
url = base_url + urlencode(params)
print(url)
Copy the code

You declare a dictionary to represent the parameters, and then call the urlencode method to serialize them as GET request parameters.

The running results are as follows:

https://www.baidu.com?name=germey&age=25
Copy the code

As you can see, the parameter was successfully converted from a dictionary type to a GET request parameter.

This method is very common. Sometimes, to make it easier to construct parameters, we will use dictionaries in advance. To convert to a URL parameter, you simply call this method.

`parse_qs`

With serialization, there must be deserialization. If we have a list of GET request parameters, we can convert it back to the dictionary using the parse_qs method, as shown in the following example:

from urllib.parse import parse_qs

query = 'name=germey&age=25'
print(parse_qs(query))
Copy the code

The running results are as follows:

{'name': ['germey'].'age': ['25']}
Copy the code

As you can see, this successfully converts back to the dictionary type.

`parse_qsl`

In addition, there is a parse_qsl method that converts parameters to a list of tuples, as shown in the following example:

from urllib.parse import parse_qsl

query = 'name=germey&age=25'
print(parse_qsl(query))
Copy the code

The running results are as follows:

[('name'.'germey'), ('age'.'25')]
Copy the code

As you can see, the result is a list, and each element in the list is a tuple whose first content is the parameter name and second content is the parameter value.

`quote`

This method converts content to a URL-encoded format. The URL with Chinese parameters may sometimes cause garbled characters. In this case, you can use this method to convert Chinese characters into URL encoding, as shown in the following example:

from urllib.parse import quote

keyword = 'wallpaper'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)
Copy the code

Here we declare a Chinese search text, and then use the quote method to urL-encode it, the final result is as follows:

https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8
Copy the code

`unquote`

With the quote method and, of course, the unquote method, it can decode urls as shown in the following example:

from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'
print(unquote(url))
Copy the code

This is the result of URL encoding obtained above, which is restored using unquote method, and the result is as follows:

https://www.baidu.com/s?wd= wallpaperCopy the code

As you can see, decoding can be done easily with the unquote method.

In this section, we introduced some common URL handling methods for the Parse module. With these methods, we can easily achieve URL parsing and construction, it is recommended to master.

4. Analyze the Robots protocol

Using the RobotParser module of URllib, we can realize the analysis of website Robots protocol. In this section, we take a brief look at the use of this module.

1. The agreement of Robots

Robots Protocol is also known as crawler Protocol and robot Protocol. Its full name is Web Crawler Exclusion Protocol, which is used to tell crawlers and search engines which pages can and cannot be crawled. It is usually a text file called robots.txt, which is placed at the root of the site.

When a search crawler visits a site, it will first check whether there is a robots.txt file in the root directory of the site. If there is, the crawler will crawl according to the crawl range defined therein. If the file is not found, the search crawler visits all directly accessible pages.

Let’s look at an example of robots.txt:

User-agent: *
Disallow: /
Allow: /public/
Copy the code

This allows all search crawlers to only crawl the public directory. Save the above content in a robots.txt file in the root directory of the site, along with the site entry files such as index.php, index.html, index.jsp, etc.

The user-agent above describes the name of the search crawler, and setting it to * here means that the protocol is valid for any crawler. For example, we could set:

User-agent: Baiduspider
Copy the code

This means that the rules we set are effective for Baidu crawlers. If there are multiple User-Agent records, multiple crawlers will be restricted by crawlers, but at least one of them needs to be specified.

Disallow specifies the directory that cannot be fetched. For example, if the value is set to/in the preceding example, all pages cannot be fetched.

Allow and Disallow are used together, not alone, to exclude certain restrictions. In the example above, we set /public/, which means that all pages are not allowed to be fetched, but the public directory can be fetched.

Let’s look at some more examples. The code forbidding all crawlers from accessing any directory is as follows:

User-agent: *
Disallow: /
Copy the code

The code that allows all crawlers to access any directory is as follows:

User-agent: *
Disallow:
Copy the code

It is also possible to leave the robots.txt file blank.

The code that prevents all crawlers from accessing certain directories on the site is as follows:

User-agent: *
Disallow: /private/
Disallow: /tmp/
Copy the code

Only one crawler is allowed to access the following code:

User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /
Copy the code

These are some common ways to write robots.txt.

The crawler name

You might wonder, where did the name of a reptile come from? Why is it called that? In fact, it has a fixed name, for example, Baidu is called BaiduSpider. Table 2- lists the names and corresponding websites of some common search crawlers.

Table of common search crawler names and their corresponding websites

The crawler name	The name of the	Web site
BaiduSpider	baidu	www.baidu.com
Googlebot	Google	www.google.com
360Spider	360 the search	www.so.com
YodaoBot	youdao	www.youdao.com
ia_archiver	Alexa	www.alexa.cn
Scooter	altavista	www.altavista.com
Bingbot	bing	www.bing.com

robotparser

Once we understand the Robots protocol, we can use the RobotParser module to parse robot.txt. The module provides a class RobotFileParser, which can judge whether a crawler has permission to crawl a web page based on the robots.txt file of a website.

This class is very simple to use, just need to pass the robots.txt link in the constructor. First take a look at its statement:

urllib.robotparser.RobotFileParser(url=' ')
Copy the code

It is also possible to declare it without passing it, leaving it empty by default, and then use the set_URL method.

The following lists several methods that are commonly used by this class.

set_url: Used to set the link to the robots.txt file. If you are creatingRobotFileParserObject, so you don’t need to use this method.
read: Read the robots. TXT file and analyze it. Note that this method performs a read and parse operation, and if this method is not called, the subsequent judgments will beFalse, so remember to call this method. This method does not return anything, but does a read.
parse: is used to parse the robots.txt file. The passed parameter is the content of some lines of robots.txt. It will analyze these contents according to the syntax rules of robots.txt.
can_fetch: This method takes two arguments, the first isUser-AgentThe second is the URL to grab. The result is whether the search engine can crawl the URLTrue 或 False.
mtime: Returns the last time to crawl and analyze the robots.txt, which is necessary for a long time to analyze and crawl the search crawler, you may need to check regularly to crawl the latest robots.txt.
modified: It is also very helpful for long time analysis and crawling search crawler, the current time is set to the last time to crawl and analyze robots.txt.

Here’s an example:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://www.baidu.com/robots.txt')
rp.read()
print(rp.can_fetch('Baiduspider'.'https://www.baidu.com'))
print(rp.can_fetch('Baiduspider'.'https://www.baidu.com/homepage/'))
print(rp.can_fetch('Googlebot'.'https://www.baidu.com/homepage/'))
Copy the code

Taking Baidu as an example, we first created the RobotFileParser object, and then set the link of robots.txt through set_URL method. Of course, if you don’t use this method, you can use the following method when declaring:

rp = RobotFileParser('https://www.baidu.com/robots.txt')
Copy the code

Then the can_fetch method is used to determine whether the web page can be fetched.

The running results are as follows:

True
True
False
Copy the code

We can also use the parse method to read and analyze a homepage, as shown in the following example: Baiduspider captures baidu’s homepage and a homepage page, but Googlebot cannot capture a homepage page.

Open baidu’s robots.txt file to have a look, you can see the following information:

User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh

User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
Copy the code

As you can see, Baiduspider does not restrict the retrieval of a homepage page, while Googlebot restricts the retrieval of a homepage page.

Read and parse can also be performed using the parse method, as shown in the following example:

from urllib.request import urlopen
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.parse(urlopen('https://www.baidu.com/robots.txt').read().decode('utf-8').split('\n'))
print(rp.can_fetch('Baiduspider'.'https://www.baidu.com'))
print(rp.can_fetch('Baiduspider'.'https://www.baidu.com/homepage/'))
print(rp.can_fetch('Googlebot'.'https://www.baidu.com/homepage/'))
Copy the code

The result is the same:

True
True
False
Copy the code

This section introduces the basic usage and examples of RobotParser module, using it, we can easily determine which pages can be fetched, which pages can not be fetched.

5. To summarize

In this section, we introduce the basic usage of urllib request, Error, parse, and RobotParser modules. These are some of the basic modules, and some of them are very useful. For example, we can use the Parse module to do various processing of urls.

This section code: github.com/Python3WebS… .

Thank you very much for reading. For more exciting content, please pay attention to my public account “Attack Coder” and “Cui Qingcai | Jingmi”.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

[2022 年] Python3 crawler tutorial – Urllib crawler experience

1. Send the request

`urlopen`

`data`parameter

`timeout`parameter

The other parameters

`Request`

Advanced usage

validation

The agent

Cookie

2. Handle exceptions

`URLError`

`HTTPError`

3. Parse links

`urlparse`

`urlunparse`

`urlsplit`

`urlunsplit`

`urljoin`

`urlencode`

`parse_qs`

`parse_qsl`

`quote`

`unquote`

4. Analyze the Robots protocol

1. The agreement of Robots

The crawler name

robotparser

5. To summarize

[2022 年] Python3 crawler tutorial – Urllib crawler experience

1. Send the request

urlopen

dataparameter

timeoutparameter

The other parameters

Request

Advanced usage

validation

The agent

Cookie

2. Handle exceptions

URLError

HTTPError

3. Parse links

urlparse

urlunparse

urlsplit

urlunsplit

urljoin

urlencode

parse_qs

parse_qsl

quote

unquote

4. Analyze the Robots protocol

1. The agreement of Robots

The crawler name

robotparser

5. To summarize

Related Posts

Nacos explores and practices

Changbuy Mall (12) : Access to wechat scan code payment

Feign [0.1] Feign objects and calls

`urlopen`

`data`parameter

`timeout`parameter

`Request`

`URLError`

`HTTPError`

`urlparse`

`urlunparse`

`urlsplit`

`urlunsplit`

`urljoin`

`urlencode`

`parse_qs`

`parse_qsl`

`quote`

`unquote`