Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

Python crawler (12) : Urllib

Urllib: A Python crawler (13)

Urllib: A Python crawler (14)

Python crawler (15) : Urllib

Python crawler (16) : Urllib crawler (16) : Urllib crawler

The introduction

In front of the preparation, we installed a lot of third-party Request library, such as Request, AioHttp and so on, I do not know if you still have the impression, do not have the impression of the students can look over the front of the article.

In the previous articles, we have had a rough understanding of the basic usage of URllib. There are indeed many inconvenient areas, such as handling Cookies or using proxy access, which need to be handled by Opener and Handler.

A more powerful Request library makes sense. With the Request library, we can use these higher-order operations much more easily.

Introduction to the

First of all, the various official addresses:

  • GitHub:https://github.com/requests/requests
  • Official document: http://www.python-requests.org
  • Chinese document: http://docs.python-requests.org/zh_CN/latest

The purpose of listing all kinds of official documents here is to hope that students can form the habit of consulting official documents. After all, xiaobian is also human and can make mistakes. Comparatively speaking, the error rate of official documents is very low, including sometimes some difficult problems can be solved through official documents.

We’ve already covered the basics of urllib, so let’s skip BB and get straight to the real business: writing code.

Here we use the same test address as previously mentioned: https://httpbin.org/.

A GET request

GET Requests are our most commonly used Requests, so let’s take a look at how to send a GET request using Requests. The code is as follows:

import requests

r = requests.get('https://httpbin.org/get')
print(r.text)Copy the code

The results are as follows:

{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Python - requests / 2.22.0"}, "origin" : "116.234.254.11, 116.234.254.11", "url" : "https://httpbin.org/get"}Copy the code

I won’t talk about it here, but it’s the same thing as urllib.

If we want to add request parameters to a GET request, how do we add them?

import requests

params = {
    'name': 'geekdigging',
    'age': '18'
}

r1 = requests.get('https://httpbin.org/get', params)
print(r1.text)Copy the code

The results are as follows:

{ "args": { "age": "18", "name": "geekdigging" }, "headers": { "Accept": "*/*", "Accept-Encoding": "Gzip, deflate", "Host": "httpbin.org"," user-agent ": "python-requests/2.22.0"}, "origin": "116.234.254.11, 116.234.254.11", "url": "https://httpbin.org/get?name=geekdigging&age=18"}Copy the code

As you can see, the requested link is automatically constructed as: https://httpbin.org/get?name=geekdigging&age=18.

It is important to note that r1.text returns a STR data type, but is actually a JSON. If you want to convert this JSON directly into a dictionary format that we can use directly, you can use the following methods:

print(type(r1.text))
print(r1.json())
print(type(r.json()))Copy the code

The results are as follows:

<class 'str'> {'args': {'age': '18', 'name': 'geekdigging'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'user-agent ': 'python-requests/2.22.0'},' Origin ': '116.234.254.11, 116.234.254.11', 'URL ': 'https://httpbin.org/get?name=geekdigging&age=18'} <class 'dict'>Copy the code

Add a request header:

Import requests headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'referer': 'https://www.geekdigging.com/' } r2 = requests.get('https://httpbin.org/get', headers = headers) print(r2.text)Copy the code

The results are as follows:

{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "Referer": "Https://www.geekdigging.com/", "the user-agent: Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}, "Origin ": "116.234.254.11 116.234.254.11,", "url" : "https://httpbin.org/get"}Copy the code

As with urllib.request, we pass the headers argument.

If we want to grab a picture or a video or something like that, what can we do?

These files are essentially composed of binary code, because there is a specific format for saving and the corresponding way of parsing, we can see these various multimedia. So, if you want to grab them, you have to get their binary code.

For example, we grab baidu logo on the photo, picture address is: https://www.baidu.com/img/superlogo_c4d7df0a003d3db9b65e9ef0fe6da1ec.png

import requests

r3 = requests.get("https://www.baidu.com/img/superlogo_c4d7df0a003d3db9b65e9ef0fe6da1ec.png")
with open('baidu_logo.png', 'wb') as f:
    f.write(r3.content)Copy the code

Results xiaobian will not show, can be downloaded normally.

A POST request

Let’s move on to a very common POST request. As with the GET request above, we still test using: https://httpbin.org/post. Example code is as follows:

Import requests headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'referer': 'https://www.geekdigging.com/' } params = { 'name': 'geekdigging', 'age': '18' } r = requests.post('https://httpbin.org/post', data = params, headers = headers) print(r.text)Copy the code

The results are as follows:

{ "args": {}, "data": "", "files": {}, "form": { "age": "18", "name": "geekdigging" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "Referer": "https://www.geekdigging.com/", "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}, "json": null, "origin": "116.234.254.11 116.234.254.11,", "url" : "https://httpbin.org/post"}Copy the code

We added the request header and parameters to the POST request.

The Response Response

We used text and JSON to get the response content above, but there are many other properties and methods you can use to get other information.

Let’s visit baidu home page to demonstrate:

import requests

r = requests.get('https://www.baidu.com')
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)Copy the code

The results are as follows:

<class 'int'> 200 <class 'requests.structures.CaseInsensitiveDict'> {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Thu, 05 Dec 2019 13:24:11 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:23:55 GMT', Pragma': 'no-cache', 'Server': 'BFE /1.0.8.18',' set-cookie ': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'} <class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]> <class  'str'> https://www.baidu.com/ <class 'list'> []Copy the code

Output status_code to get the status code, headers to get the response header, cookies to get cookies, URL to get URL, and history to get the request history.

The sample code

All of the code in this series will be available on Github and Gitee.

Example code -Github

Example code -Gitee