This is the 24th day of my participation in Gwen Challenge

“If something is important enough, even if the odds are against you, you should still do it.” You should do it too. – musk

Emmm is ready to write the first article about Python crawlers. As for why the Requests library is the first article, Python crawlers typically use the Requests library in conjunction with other libraries to crawl, or use mainstream crawler frameworks such as Scrapy to crawl, as we’ll learn later. Secondly, Requests is also very beginner friendly, and while asynchronous Ctrip may be used later, using Requests+ single or multi-threaded is still the first choice for beginners.

Secondly, it should be noted that crawlers are not difficult to learn, but front-end knowledge such as HTML and understanding of the network is slightly more important than learning Python. In a nutshell, we need to figure out how crawlers work.

NO.1 Preparing the Python environment

The first step is to install the Python environment.

Use PyCharm or Anaconda’s Spyder. You can install either of these two ides. There was an introduction before.

Next, download the Requests library, open a CMD window and type PIP Install Requests. The installation of third-party libraries will be covered later.

You can start by writing a small program to check if the above steps have been successfully completed:

Get ("http://www.baidu.com") print(r.tatus_code)# print(type(r))#r Print (r.haaders)# return the header of the get request pageCopy the code

The following output page describes the Python environment configuration and library download without problems (networking) :

To: (status code) r.tatus_code: return status of an HTTP request. 200 indicates success, 404 or other indicates failure.

No.2 About the Requests library

Let’s start with seven main ways to use the Requests library:

Requests. Request () : The basic method that constructs a request that supports each of the following methods. Requests. Get (): The primary method for getting HTML web pages, as opposed to HTTP GET. Requests. Head (): Method of getting HTML page headers, corresponding to HTTP head. Requests. Post (): A method for submitting a POST request to an HTML page, corresponding to Http posts. Requests. Put (): Method for submitting PUT requests to HTML pages, corresponding to HTTP PUT. Requests. Patch (): Submit local modification requests to HTML pages, corresponding to HTTP patches. Requests.delete(): Submit a delete request to an HTML page, corresponding to HTTP delete.Copy the code

A. The Get method

R = Requests. Get (URL) constructs a Request object that Requests resources from the server. Returns a Response object containing the server resource.

Form of use:

Get (url,params=None,**kwargs) 1.url: the url to which the page is to be retrieved. 2. Params: Additional parameters in the URL, dictionary or byte stream format, optional. 3.**kwargs:12 parameters to control access.

To: include two important objects from the Requests library in r=requests. Get (URL) : a Response object pointing To Requests that contains the content returned by the crawler; There’s also a Request object that points to GET.

To say Response, we have to say the properties of the Response object:

R.status_code: indicates the return status of the HTTP request. 200 indicates that the connection succeeded, 404 indicates that the connection failed. R.ext: indicates the character form of the HTTP response content, that is, the page content corresponding to the URL. R.encoding: The response content encoding guessed from the HTTPheader. R.apparent_encoding: Encoding of the response content analyzed from the content (optional). R.tent: the binary form of the HTTP response content.Copy the code

And the encoding of Response object:

R.emconding: If no charset exists in the header, the encoding is assumed to be ISO-8859-1.

R.apparent_encoding: Encoding method based on web page content analysis.

For example, when accessing a page with the Requests method, we would first use R.tatus_code to see if it was 200, and if it was, we could look at r.ext or R.encoding. If 404 or other, it is an error or exception for some reason.

At the console, try the following statement:

R =requests. Get ("http://www.baidu.com") r.tatus_code r.ext R.coding R.aparent_encoding R.coding = 'UTF-8 '# alternate encoding substitution r.textCopy the code

Console input is as follows:

To: Do not use the print statement in the console.

2. The head method

R =requests. Head (‘bob0912.github. IO ‘) use R.healers to get header information. At the console, try the following:

r=requests.get("http://www.baidu.com")
r.status_code
r.headers
Copy the code

The output is as follows:

3. Post method

Post: Post a dictionary to the URL automatically encoded as a form, as in the following statement:

payload={'key1':'value1','key2':'value2'}
r=requests.post('http://bob0912.github.io',data=payload)
print(r.text)
Copy the code

4. The Request method

Form of use:

requests.request(method,url,**kwargs)

1. Method: indicates the request mode, including GET, PUT, and POST.

2. Url: The link to the URL of the page to be obtained.

3.** kwargs: control access parameters, a total of 13.

Seven ways to request method:

r=requests.request('GET',url,**kwargs)
r=requests.request('HEAD',url,**kwargs)
r=requests.request('POST',url,**kwargs)
r=requests.request('PUT',url,**kwargs)
r=requests.request('PATCH',url,**kwargs)
r=requests.request('delete',url,**kwargs)
r=requests.request('OPTIONS',url,**kwargs)
Copy the code

**kwargs: control access parameters, all optional

Params: dictionary or sequence of bytes added as arguments to a URL, as in:

kv={'key1':'value1','key2':'value2'}
r=requests.request('GET','http://python123.io/ws',params=kv)
print(r.url)
Copy the code

The parameters that control access are: data: dictionary, byte sequence, or file object, as the contents of Request.

kv={'key1':'value1','key2':'value2'}
r=requests.request('POST','http://python123.io/ws',data=kv)
Copy the code

Json: Data in JSON format, used as the content of the Request.

kv={'key1':'value1'}
r=requests.request('POST','http://python123.io/ws',json=kv)
Copy the code

Headers: dictionary, HTTP custom headers.

hd={'user-agent':'chrome/10'}
r=requests.request('POST','http://bob0912.github.io',headers=hd)
Copy the code

Cookies: dictionary or CookieJar, cookies in Request. Auth: a tuple that supports HTTP authentication. Both of these are advanced features.

Files: dictionary type, transfer files.

fs={'file':open('data.xls','rb')}
r=requests.request('POST','http://bob0912.github.io',files=fs)
Copy the code

Timeout: Indicates the timeout period, in seconds.

r=requests.request('GET','http://bob0912.github.io',timeout=10)
Copy the code

Proxies: Dictionary type, set access to proxy servers, can add login authentication.

PXS = {' HTTP: 'http://user:[email protected]:1234' 'HTTP:' https://10.10.10.1.4321 '} r=requests.request('GET','http://www.baidu.com',proxies=pxs)Copy the code

In this way, when we visit Baidu, the IP address used is the IP address of the server, which can effectively hide their OWN IP address.

Advanced features:

Allow_redirects :True/False, default is True, redirect switch. Stream: True/False, default is True, get content immediately download switch. Verify: True/False. The default value is True. Authenticates the SSL certificate. Cert: indicates the path of the local SSL certificate.

Python crawler series, to be continued…