Python crawler (12) : Urllib

Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

The introduction

In the last article we talked about the basic gestures of Urlopen, but these few simple parameters are not enough to build a complete request. For complex requests, such as adding headers, it is not possible to use Request.

Request

The official document: https://docs.python.org/zh-cn/3.7/library/urllib.request.html

Let’s look at the syntax for using Request:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)Copy the code

Url: the url of the requested address. Only this parameter is mandatory. All other parameters are optional.
Data: If this parameter is to be passed, bytes must be passed.
Headers: Request header information, which is a dictionary that can be constructed between headers when constructing a request or added by calling add_header().
originreqHost: the host name or IP address of the requesting party.
Unverifiable: Indicates whether the request is unverifiable. The default is False. This means that the user does not have sufficient permission to select the result of receiving the request. For example, if you are asking for an image in an HTML document, and you don’t have the right to automatically load the image, the unverifiable value will be True.
Method: Request methods, such as GET, POST, PUT, and DELETE.

Let’s start with a simple example of using Request to crawl a blog site:

import urllib.request

request = urllib.request.Request('https://www.geekdigging.com/')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))Copy the code

As you can see, urlopen() is still used to initiate the Request, but instead of the URL, Data, timeout, etc., the parameters are now Request objects.

Let’s build a slightly more complex request.

import urllib.request, urllib.parse import json url = 'https://httpbin.org/post' headers = { 'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Content-Type': 'Application /json; encoding=utf-8', 'Host': 'geekdigging.com' } data = { 'name': 'geekdigging', 'hello':'world' } data = bytes(json.dumps(data), encoding='utf8') req = urllib.request.Request(url=url, data=data, headers=headers, method='POST') resp = urllib.request.urlopen(req) print(resp.read().decode('utf-8'))Copy the code

The results are as follows:

{ "args": {}, "data": "{\"name\": \"geekdigging\", \"hello\": \"world\"}", "files": {}, "form": {}, "headers": { "Accept-Encoding": "identity", "Content-Length": "41", "Content-Type": "application/json; Encoding = UTF-8 ", "Host": "geekdigging.com", "user-agent ": "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}, "json": {"hello": "world", "name": "Geekdigging}" and "origin", "116.234.254.11 116.234.254.11", "url" : "https://geekdigging.com/post"}Copy the code

Here we build a Request object with four parameters.

The url specifies the link to access, again the test link mentioned in the previous article.

User-agent, Content-type, and Host are specified in headers.

Json.dumps () is used in data to convert a dict to JSON format, and bytes() is eventually converted to a byte stream.

Finally, the access mode is specified as POST.

From the final result, we can see that all our previous Settings were successful.

Advanced operation

Previously we added the Request header using Request, but if we want to handle Cookies and use proxy access, we need to use a more powerful Handler. Handler is simply a functional processor that can do almost everything for us about HTTP requests.

Urllib. request provides us with the BaseHandler class, which is the parent of all other handlers. It provides the following methods for direct use:

Add_parent () : Adds director as the parent class.
Close () : Closes its parent class.
Parent () : Open to use a different protocol or handle errors.
Default_open () : Captures all urls and subclasses, called before the protocol is opened.

Next, there are various Handler subclasses that integrate the BaseHandler class:

HTTPDefaultErrorHandler: Used to handle HTTP response errors that raise an exception of the HTTPError class.
HTTPRedirectHandler: Used to handle redirects.
ProxyHandler: Used to set the proxy. The default proxy is empty.
HTTPPasswordMgr: Used to manage passwords. It maintains tables of user names and passwords.
AbstractBasicAuthHandler: Used to get the user/password pair and retry the request to process the authentication request.
HTTPBasicAuthHandler: Used to retry requests with authentication information.
HTTPCookieProcessor: Used to handle cookies.

Urllib provides a set of BaseHandler subclasses, which are not listed here. You can view them by accessing the official documentation.

The official documentation address: https://docs.python.org/zh-cn/3.7/library/urllib.request.html#basehandler-objects

Before I show you how to use Handler, I’ll introduce an advanced class: OpenerDirector.

OpenerDirector is a high-level class that handles urls and opens them in three stages:

The order in which these methods are called in each phase is determined by sorting handler instances; Each program using this method calls the ProtocolRequest () method to process the request, and then calls the ProtocolOpen () method to process the request; Finally, the protocol_response() method is called to process the response.

We can call OpenerDirector Opener. We’ve used the urlopen() method before, which is actually a Opener provided by urllib.

Opener’s methods include:

Add_handler (handler) : Adds handlers to links
Open (URL,data=None[,timeout]) : Opens the given URL the same as the urlopen() method
Error (PROto,* ARGS) : Handles errors for a given protocol

Let’s demonstrate how to get Cookies from a website:

import http.cookiejar, Urllib. request # instantiate cookiejar object cookie = http.cookiejar.cookiejar () # build a handler with HTTPCookieProcessor handler = Urllib. Request. HTTPCookieProcessor (cookies) # building Opener Opener = urllib. Request. Build_opener (handler) response = # by request opener.open('https://www.baidu.com/') print(cookie) for item in cookie: print(item.name + " = " + item.value)Copy the code

The specific meaning of the code will not be explained, comments have been written more complete. The final print result is as follows:

<CookieJar[<Cookie BAIDUID=48EA1A60922D7A30F711A420D3C5BA22:FG=1 for .baidu.com/>, <Cookie BIDUPSID=48EA1A60922D7A30DA2E4CBE7B81D738 for .baidu.com/>, <Cookie PSTM=1575167484 for .baidu.com/>, <Cookie BD_NOT_HTTPS=1 for www.baidu.com/>]>
BAIDUID = 48EA1A60922D7A30F711A420D3C5BA22:FG=1
BIDUPSID = 48EA1A60922D7A30DA2E4CBE7B81D738
PSTM = 1575167484
BD_NOT_HTTPS = 1Copy the code

This raises a question: since cookies can be printed, can we save the output of cookies to a file?

The answer is yes, of course, because we know that cookies themselves are stored in files.

Type # cookies save Mozilla file example filename = 'cookies_mozilla. TXT' cookie. = HTTP cookiejar. MozillaCookieJar handler = (filename)  urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = Opener. Open ('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True) print(' Cookies_mozilla saved successfully ')Copy the code

Here we need to change the CookieJar to Mozilla CookkieJar, which is used for generating files. It is a subclass of the CookieJar that can handle Cookies and file-related events, such as reading and saving Cookies. You can save Cookies in the Mozilla browser’s Cookies format.

After running, we can see that a cookies. TXT file is generated in the directory of the current program. The details are as follows:

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com    TRUE    /    FALSE    1606703804    BAIDUID    0A7A76A3705A730B35A559B601425953:FG=1
.baidu.com    TRUE    /    FALSE    3722651451    BIDUPSID    0A7A76A3705A730BE64A1F6D826869B5
.baidu.com    TRUE    /    FALSE        H_PS_PSSID    1461_21102_30211_30125_26350_30239
.baidu.com    TRUE    /    FALSE    3722651451    PSTM    1575167805
.baidu.com    TRUE    /    FALSE        delPer    0
www.baidu.com    FALSE    /    FALSE        BDSVRTM    0
www.baidu.com    FALSE    /    FALSE        BD_HOME    0Copy the code

Xiaobian is lazy, not screenshots, direct paste results.

Of course, we can save cookies as libwww-Perl (LWP) file in addition to Mozilla browser format.

To save Cookies in LWP format, change to LWPCookieJar at declaration time:

# cookies save type LWP file example filename = 'cookies_lwp. TXT' cookie. = HTTP cookiejar. LWPCookieJar handler = (filename) urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = Opener. Open ('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True) print(' cookies_LWp saved successfully ')Copy the code

The result is as follows:

# LWP - Cookies - 2.0 - Set - Cookie3: BAIDUID = "D634D45523004545C6E23691E7CE3894: FG = 1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2020-11-30 02:45:24Z"; comment=bd; version=0 Set-Cookie3: BIDUPSID=D634D455230045458E6056651566B7E3; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-12-19 05:59:31Z"; version=0 Set-Cookie3: H_PS_PSSID=1427_21095_30210_18560_30125; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0 Set-Cookie3: PSTM=1575168325; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-12-19 05:59:31Z"; version=0 Set-Cookie3: delPer=0; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0 Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0 Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0Copy the code

As you can see, the cookie file formats produced by the two types are quite different.

The cookie file has been generated. The next step is to add a cookie to the request as shown in the following example:

Type # request is to use the Mozilla file cookies. = HTTP cookiejar. MozillaCookieJar () cookie. Load (' cookies_mozilla. TXT, ignore_discard = True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))Copy the code

Here we use the load() method to read the local Cookies file and get the contents of the Cookies.

The premise is that we need to generate Mozilla format cookie file in advance, and then use the same method to build Handler and Opener after reading Cookies.

Request normal when the corresponding ferries home page source, the results xiaobian is not posted, true a bit long.

That’s the end of this article. I hope you remember to write your own code

The sample code

All of the code in this series will be available on Github and Gitee.

Example code -Github

Example code -Gitee

reference

https://www.cnblogs.com/zhangxinqi/p/9170312.html

https://cuiqingcai.com/5500.html

The introduction

Request

Advanced operation

The sample code

reference

Related Posts

Advanced JAVA development skills: Java8 new date and time API (2) jSR-310: common date and time API

How to design a high concurrency architecture for a large website with tens of millions of users?

ES5.4 source code analysis startup process