The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article is from Tencent Cloud by keinYe

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.) Urllib is Python’s own web request standard library. It contains several modules that handle URL functions.

  • Urllib. request is used to request and read urls “including webpage authentication, redirection, cookies, etc.” to conveniently obtain URL content.

  • Urllib. error is used for urlib.request exception handling.

  • Urllib. parse is used for url resolution.

  • Urllib. robotParse is used to parse robot. TXT files.

Urllib. request and urllib.error are two libraries that we use frequently in crawlers.

urllib.request

The urllib.request module lets you send HTTP requests and read the results.

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False
, context=None
Copy the code

The parameters are as follows:

  • A URL is a web page address, which can be a domain name or an IP address.
  • Data is the data sent to the server. This parameter can be omitted if no data is sent. Data is bytes and can be converted into a byte stream using the bytes() function
  • Timeout Sets the request timeout period. The units are seconds.
  • Cafile and capath indicate the PATH of the CA certificate and CA certificate. This is required if you use HTTPS.
  • Cadefault is currently deprecated.
  • The context parameter must be of type SSL. SSLContext, which specifies SSL Settings

Crawl web content

Urllib.request. Urlopen can be used to easily obtain web content. We take httpbin.org as an example to introduce the method of using urlopen

The running results are as follows:

It is inevitable that network requests cannot be connected for a long time. In this case, you can set the timeout time “timeout” to make the urlopen method exit automatically when it cannot be connected within a certain period of time, so as not to affect the operation of the whole program. We can set the timeout in the following wayIf the above code cannot connect properly within 5 seconds, exit the urlopen method.

Submit data to the server

When submitting data to the server or requesting web pages that need to carry data, a POST request is required. In this case, you only need to pass the data as bytes to the data parameter.

The following is an example of submitting data using POSTThe result is as follows

Mock browser request

We used the urlopen method to do simple GET and POST requests earlier, but only a few parameters in the Urlopen method are not enough to build a complete request, which usually contains information such as headers. We can use the urllib.request. request class to build network requests that contain headers and request methods.Here is the definition of urllib.request.request:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
Copy the code

# -*- coding:utf-8 -*-
from urllib import request
url ='https://httpbin.org/get'
headers =
 
{'User-Agent':"Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}
Copy the code

The results


"User-Agent"
:
 
"Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"  
},
  
"origin":
 "119.137.3.11, 119.137.3.11"."url": "https://httpbin.org/get"
}
Copy the code

As you can see from the results, the httpbin.org put back data contains the browser information we submitted to Httpbin.org.

urllib.error

Network communication is an asynchronous communication process, and exceptions will inevitably occur. In this case, urllib.error is used to handle the error “if you do not handle the error, the program will be interrupted”, which will increase the robustness of the program.

Urlib. error has three exception handling classes, URLError, HTTPError, and ContentToolShortError.

URLError is the base class of the urllib.error exception, a subclass of OSError that is raised when a program encounters an error during execution. The URLError class takes a Reason attribute that returns the reason for the exception. Reason is either a message string or an exception instance.

URLError sample code:

HTTPError is an exception class dedicated to handling HTTP and HTTPS request errors. HTTPError can also be used as a special file return value “which is the same as URLopen’s return”. HTTPError is an HTTPError subclass of URLError. It has code, Reason, and headers. Code is the return of an HTTP request. Headers is the request header returned by an HTTP request.

HTTPError sample code: