[2022 年] Python3 crawler tutorial - Easy to use for requests

😀 This is the sixth original crawler column

In the previous section, we looked at the basic usage of urllib, but there are some inconveniences, such as the need to write Opener and Handler to handle web page authentication and cookies. In addition, it is not convenient to write when we want to implement requests such as POST and PUT.

To make it easier to do this, we have the more powerful library requests, with which cookies, login authentication, proxy Settings, and so on don’t matter.

Here’s a taste of how powerful it can be.

1. Preparation

Before you begin, make sure you have installed the Requests library properly. If not, use PIP3 to do so:

pip3 install requests
Copy the code

More detailed installation instructions can refer to the setup. Scrape the center/requests.

2. Instance introduction

The urlopen method in urllib actually requests web pages in GET mode, whereas the corresponding method in Requests is GET method. Here’s an example:

import requests

r = requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text[:100])
print(r.cookies)
Copy the code

The running results are as follows:

<class 'requests.models.Response'> < 200class 'str'>
<!DOCTYPE html> <! --STATUS OK--><html> <head><meta http-equiv=content-typecontent=text/html; charse <RequestsCookieJar[<Cookie BDORZ=27315for .baidu.com/>]>
Copy the code

Here, we call get method to achieve the same operation as urlopen, get a Response object, and output the type, status code, type, content and Cookie of Response respectively.

The result shows that the return type is requests.models.Response, the Response body is of type string STR, and the Cookie is of type RequestsCookieJar.

Using the GET method to successfully implement a GET request is not a big deal. More conveniently, other request types can still be done in one sentence, as shown in the following example:

import requests

r = requests.get('https://httpbin.org/get')
r = requests.post('https://httpbin.org/post')
r = requests.put('https://httpbin.org/put')
r = requests.delete('https://httpbin.org/delete')
r = requests.patch('https://httpbin.org/patch')
Copy the code

Post, PUT, delete and other requests are implemented by post, PUT, delete and other methods. Isn’t that much easier than urllib?

That’s just the tip of the iceberg. There’s more to come.

3. GET request

One of the most common requests in HTTP is a GET request, so let’s take a closer look at how to build A GET request from Requests.

The basic instance

First, build a simple GET request with a link to httpbin.org/get. The site will determine if the client initiated a GET request and return the corresponding request information:

import requests

r = requests.get('https://httpbin.org/get')
print(r.text)
Copy the code

The running results are as follows:

{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Python-requests /2.22.0"," x-amzn-trace-id ": "Root= 1-5e6e3a2E-6b1a28288d721c9e425a462A "}, "origin": "17.20.233.237", "url" : "https://httpbin.org/get"}Copy the code

It can be found that we successfully initiated a GET request, and the return result contains the request header, URL, IP and other information.

So, for GET requests, if you want to attach additional information, how do you typically do that? For example, if you want to add two parameters, where name is germey and age is 25, you can write the URL as follows:

https://httpbin.org/get?name=germey&age=25
Copy the code

To construct the request link, should I write it like this?

r = requests.get('https://httpbin.org/get?name=germey&age=25')
Copy the code

That’s fine, but isn’t it a little dehumanizing? These parameters also need to be concatenated manually, which is a bit inelegant to implement.

In general, this information can be passed directly with the params parameter, as shown in the following example:

import requests

data = {
    'name': 'germey'.'age': 25
}
r = requests.get('https://httpbin.org/get', params=data)
print(r.text)
Copy the code

The running results are as follows:

{
  "args": {
    "age": "25"."name": "germey"
  },
  "headers": {
    "Accept": "* / *"."Accept-Encoding": "gzip, deflate"."Host": "httpbin.org"."User-Agent": "Python - requests / 2.10.0"
  },
  "origin": "122.4.215.33"."url": "https://httpbin.org/get?age=22&name=germey"
}
Copy the code

Here we pass the URL argument through a dictionary to the params argument of the get method. By returning the information, we can tell that the requested link is automatically constructed as: httpbin.org/get?age=22&… , so we don’t have to go to the construction of URL, very convenient.

In addition, the return type of the web page is actually STR, but it is special in JSON format. So, if you want to parse the result directly and get a JSON format, you can call the JSON method directly. The following is an example:

import requests

r = requests.get('https://httpbin.org/get')
print(type(r.text))
print(r.json())
print(type(r.json()))
Copy the code

The running results are as follows:

<class'str'>
{'headers': {'Accept-Encoding': 'gzip, deflate'.'Accept': '* / *'.'Host': 'httpbin.org'.'User-Agent': 'the python - requests / 2.10.0'}, 'url': 'http://httpbin.org/get'.'args': {}, 'origin': '182.33.248.131'}
<class 'dict'>
Copy the code

As you can see, you can convert a string that returns a JSON format to a dictionary by calling the JSON method.

But it is important to note that if you return the result is not a JSON format, then there will be a parse error, throw JSON. The decoder. JSONDecodeError anomalies.

Scraping of the page

The above request link returns a JSON string, so if you were to request a regular web page, you would definitely get the corresponding content. Let’s try it out with an example page ssr1.corone.center /, and add a bit of logic to extract information to refine the code to look something like this:

import requests
import re

r = requests.get('https://ssr1.scrape.center/')
pattern = re.compile('
      
        (. *?) '
      *?>, re.S)
titles = re.findall(pattern, r.text)
print(titles)
Copy the code

In this example we used the most basic regular expression to match all the problem content. We’ll cover regular expressions in more detail in the next section, but here’s an example to help.

The running results are as follows:

['The Shawshank Redemption'.Farewell My Concubine.'Titanic'.'Roman Holiday'.'This killer is not too cold. - Leon'.'Waterloo Bridge'.2. Tang Bohu dot Qiuxiang - Flirting Scholar.'The King of Comedy'.'The Truman Show'.'Live - To Live']
Copy the code

We found that all movie titles were successfully extracted, and a basic fetching and extracting process was completed.

Fetching binary data

In the example above, we’re grabbing a page from a web site, which actually returns an HTML document. What if you want to grab pictures, audio, video, etc.?

Pictures, audio and video files are essentially composed of binary codes, and we can see all kinds of multimedia because of specific saving formats and corresponding parsing methods. So, to grab them, you have to get their binary data.

Here’s an example of a site icon for a sample site:

import requests

r = requests.get('https://scrape.center/favicon.ico')
print(r.text)
print(r.content)
Copy the code

The content captured here is the site icon, which is a small icon that appears on each TAB of the browser, as shown below:

Two properties of the Response object are printed here, one is text and the other is content.

The results are shown in the figure, the results of R.ext and R.Tent, respectively.

Notice that the former is garbled and the latter is preceded by a B, which indicates bytes. Since the image is binary data, the former is converted to STR when printed, which means the image is converted directly to a string, which naturally causes garbled characters.

We can not understand the result returned above, it is actually the binary data of the picture, it doesn’t matter, we just extract the information to save, the code is as follows:

import requests

r = requests.get('https://scrape.center/favicon.ico')
with open('favicon.ico'.'wb') as f:
    f.write(r.content)
Copy the code

Here we use the open method, whose first argument is the file name, and the second argument is open to write binary data to the file.

After the run, you can see that an icon named Favicon.ico appears in the folder, as shown in the figure.

So, we have saved the binary data successfully as a picture, and the little icon has been successfully climbed down.

Similarly, audio and video files can be retrieved in this way.

Add headers

If you want to set the Request Headers Headers for an HTTP Request, you can set the Request Headers for the HTTP Request.

It’s easy to do this by using the headers parameter.

In this example, we did not set the Request Headers information. If we do not set the Request Headers information, some sites may find that this is not a normal browser Request, and the site may return an abnormal result, causing the page fetching to fail.

To add Headers information, for example, if we want to add a user-agent field, we can write:

import requests


headers = {
    'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get('https://ssr1.scrape.center/', headers=headers)
print(r.text)
Copy the code

Of course, we can add any other field information to the HEADERS parameter.

4. A POST request

Earlier we looked at the basic GET request, but another common request is POST. Implementing POST requests using Requests is also very simple, as shown in the following example:

import requests

data = {'name': 'germey'.'age': '25'}
r = requests.post("https://httpbin.org/post", data=data)
print(r.text)
Copy the code

Here again the request is httpbin.org/post, and the site can determine if the request is POST and return the relevant request information.

The running results are as follows:

{ "args": {}, "data": "", "files": {}, "form": { "age": "25", "name": "germey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "18", "Content-Type": "Application/X-www-form-urlencoded ", "Host": "httpbin.org"," user-agent ": "python-requests/2.22.0", "x-amzn-trace-id ": application/x-www-form-urlencoded", "Host": "httpbin.org"," user-agent ": "python-requests/2.22.0", "X-amzn-trace-id ": Root=1-5e6e3b52-0f36782ea980fce53c8c6524"}, "json": null, "origin": "17.20.232.237", "url": "https://httpbin.org/post" }Copy the code

As you can see, we successfully got the return result, where the form part is the submitted data, proving that the POST request was successfully sent.

5. Response

When you send a request, you get a response. In the example above, we get the content of the response using text and content. In addition, there are many properties and methods that can be used to retrieve additional information, such as status codes, response headers, cookies, and so on. The following is an example:

import requests

r = requests.get('https://ssr1.scrape.center/')
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)
Copy the code

Output status_code to obtain the status code, headers to obtain the response header, cookies to obtain cookies, URL to obtain URL, and history to obtain the request history.

The running results are as follows:

<class 'int'> < 200class 'requests.structures.CaseInsensitiveDict'> {'Server': 'nginx / 1.17.8'.'Date': 'Sat, 30 May 2020 16:56:40 GMT'.'Content-Type': 'text/html; charset=utf-8'.'Transfer-Encoding': 'chunked'.'Connection': 'keep-alive'.'Vary': 'Accept-Encoding'.'X-Frame-Options': 'DENY'.'X-Content-Type-Options': 'nosniff'.'Strict-Transport-Security': 'max-age=15724800; includeSubDomains'.'Content-Encoding': 'gzip'}
<class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[] > <class 'str'> https://ssr1.scrape.center/
<class 'list'> []
Copy the code

As you can see, the headers and cookies properties yield results of type CaseInsensitiveDict and RequestsCookieJar, respectively.

In the first chapter, we know that the status code is used to represent the Response status. For example, returning 200 means that the Response we get is ok, and the output result of the above example is also 200. Therefore, we can know whether the climb is successful by judging the status code of Response.

Requests also provides a built-in status code query object called requests. Codes, which can be used as an example:

import requests

r = requests.get('https://ssr1.scrape.center/')
exit() if not r.status_code == requests.codes.ok else print('Request Successfully')
Copy the code

Here, we compare the return code with the built-in success return code to ensure that the request gets a normal response and output a success message, otherwise the program terminates. Here, we use requests. Codes. ok to get the success status code 200.

That way, we don’t have to write numbers in the program anymore, and it’s more intuitive to use strings to represent the status code.

Of course, it can’t just be OK.

The return codes and corresponding query criteria are listed below:

# informational status code
100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long'.'request_uri_too_long'),

Success status code
200: ('ok'.'okay'.'all_ok'.'all_okay'.'all_good'.'\\o/'.'✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info'.'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content'.'reset'),
206: ('partial_content'.'partial'),
207: ('multi_status'.'multiple_status'.'multi_stati'.'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),

Redirection status code
300: ('multiple_choices',),
301: ('moved_permanently'.'moved'.'\\o-'),
302: ('found',),
303: ('see_other'.'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect'.'temporary_moved'.'temporary'),
308: ('permanent_redirect'.'resume_incomplete'.'resume',), These 2 to be removed in 3.0

Client error status code
400: ('bad_request'.'bad'),
401: ('unauthorized',),
402: ('payment_required'.'payment'),
403: ('forbidden',),
404: ('not_found'.'-o-'),
405: ('method_not_allowed'.'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required'.'proxy_auth'.'proxy_authentication'),
408: ('request_timeout'.'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed'.'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type'.'unsupported_media'.'media_type'),
416: ('requested_range_not_satisfiable'.'requested_range'.'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot'.'teapot'.'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity'.'unprocessable'),
423: ('locked',),
424: ('failed_dependency'.'dependency'),
425: ('unordered_collection'.'unordered'),
426: ('upgrade_required'.'upgrade'),
428: ('precondition_required'.'precondition'),
429: ('too_many_requests'.'too_many'),
431: ('header_fields_too_large'.'fields_too_large'),
444: ('no_response'.'none'),
449: ('retry_with'.'retry'),
450: ('blocked_by_windows_parental_controls'.'parental_controls'),
451: ('unavailable_for_legal_reasons'.'legal_reasons'),
499: ('client_closed_request',),

Server error status code
500: ('internal_server_error'.'server_error'.'/o\\'.'✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable'.'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported'.'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded'.'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required'.'network_auth'.'network_authentication')
Copy the code

For example, if you want to determine if the result is 404, you can use requests. Codes.not_found.

6. Advanced usage

Earlier we looked at the basic uses of Requests, such as basic GET and POST requests, and Response objects. Let’s take a look at the advanced uses of Requests, such as file upload, Cookie Settings, proxy Settings, and more.

File upload

We know that Requests can simulate submitting some data. If some websites need to upload files, we can also use it to achieve, this is very simple, as shown in the following example:

import requests

files = {'file': open('favicon.ico'.'rb')}
r = requests.post('https://httpbin.org/post', files=files)
print(r.text)
Copy the code

In the previous section we saved a file, favicon.ico, which we use this time to simulate the file upload process. It is important to note that favicon.ico needs to be in the same directory as the current script. If there are other files, of course, you can use other files to upload, just change the code.

The running results are as follows:

{
  "args": {},
  "data": ""."files": {
    "file": "data:application/octet-stream; base64,AAABAAI..."
  },
  "form": {},
  "headers": {
    "Accept": "* / *"."Accept-Encoding": "gzip, deflate"."Content-Length": "6665"."Content-Type": "multipart/form-data; boundary=41fc691282cc894f8f06adabb24f05fb"."Host": "httpbin.org"."User-Agent": "Python - requests / 2.22.0"."X-Amzn-Trace-Id": "Root=1-5e6e3c0b-45b07bdd3a922e364793ef48"
  },
  "json": null."origin": "16.20.232.237"."url": "https://httpbin.org/post"
}
Copy the code

The site will return a response containing the files field and an empty form field, indicating that the file upload section will be identified by a separate Files field.

A Cookie is set

We used urllib to handle cookies, which was a bit complicated to write, but with Requests, getting and setting cookies is a one-step process.

Let’s take a look at the Cookie retrieval process with an example:

import requests

r = requests.get('https://www.baidu.com')
print(r.cookies)
for key, value in r.cookies.items():
    print(key + '=' + value)
Copy the code

The running results are as follows:

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315
Copy the code

Here we successfully get the Cookie by first calling the cookies property, which is of type RequestCookieJar. Then, the items method is used to convert it into a list composed of tuples, and the name and value of each Cookie entry are iterated to realize the traversal parsing of cookies.

Of course, we can also use cookies to maintain the login state. For example, we can log in to GitHub and copy the Headers Cookie, as shown in the figure.

You can replace this with your own Cookie, set it to Headers, and send the request as shown in the following example:

import requests

headers = {
    'Cookie': '_octo = GH1.1.1849343058.1576602081; _ga = GA1.2.90460451.1576602111; __Host-user_session_same_site=nbDv62kHNjp4N5KyQNYZ208waeqsmNgxFnFC88rnV7gTYQw_; _device_id=a7ca73be0e8f1a81d1e2ebb5349f9075; user_session=nbDv62kHNjp4N5KyQNYZ208waeqsmNgxFnFC88rnV7gTYQw_; logged_in=yes; dotcom_user=Germey; tz=Asia%2FShanghai; has_recent_activity=1; _gat=1; _gh_sess=your_session_info'.'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
}
r = requests.get('https://github.com/', headers=headers)
print(r.text)
Copy the code

We found that the results contained results that could only be included after login, as shown in the figure below:

You can see that my GitHub username is included here, and you can also get yours if you try.

If we get something like this, we have successfully simulated the login state with cookies so that we can climb the page that we can only see after logging in.

We can create a RequestsCookieJar object and assign the Cookie to it, as shown in the following example:

import requests

cookies = '_octo = GH1.1.1849343058.1576602081; _ga = GA1.2.90460451.1576602111; __Host-user_session_same_site=nbDv62kHNjp4N5KyQNYZ208waeqsmNgxFnFC88rnV7gTYQw_; _device_id=a7ca73be0e8f1a81d1e2ebb5349f9075; user_session=nbDv62kHNjp4N5KyQNYZ208waeqsmNgxFnFC88rnV7gTYQw_; logged_in=yes; dotcom_user=Germey; tz=Asia%2FShanghai; has_recent_activity=1; _gat=1; _gh_sess=your_session_info'
jar = requests.cookies.RequestsCookieJar()
headers = {
    'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
}
for cookie in cookies.split('; '):
    key, value = cookie.split('='.1)
    jar.set(key, value)
r = requests.get('https://github.com/', cookies=jar, headers=headers)
print(r.text)
Copy the code

Here we first create a RequestCookieJar object, then split the copied cookies using the split method, and then set the key and value of each Cookie using the set method. You can then call the Get method of Requests and pass the cookies parameter.

After the test, it is found that the login can also be normal.

The Session to maintain

In Requests, it’s possible to emulate web requests directly using get or POST, but that’s essentially a different Session, meaning you’re opening two different pages in two browsers.

Consider a scenario where the first request uses the Requests POST method to log in to a site, and the second request uses the Requests GET method to request the personal information page after the successful log-in.

In fact, this is equivalent to opening two browsers, two completely independent operations, corresponding to two completely unrelated sessions, can successfully obtain personal information? Of course not.

Someone might have said, why don’t I just set the same Cookies for both requests? Yes, but it’s cumbersome to do, and we have a simpler solution.

The main solution to this problem is to maintain the same Session, which is equivalent to opening a new browser TAB instead of opening a new browser. But I don’t want to set Cookies every time, so what do I do? Here comes a new tool — the Session object.

With it, we can easily maintain a Session and don’t have to worry about cookies, it takes care of it automatically.

Let’s do a little experiment first. If we use the previous method, the example is as follows:

import requests

requests.get('https://httpbin.org/cookies/set/number/123456789')
r = requests.get('https://httpbin.org/cookies')
print(r.text)
Copy the code

Here we request a test site httpbin.org/cookies/set… . When you request this url, you can set a Cookie entry called Number, 123456789, and then request httpbin.org/cookies, which will get the current Cookie information.

Can I successfully get the Cookie I set? Give it a try.

The running results are as follows:

{
  "cookies": {}}Copy the code

That doesn’t work.

Let’s try using Session again:

import requests

s = requests.Session()
s.get('https://httpbin.org/cookies/set/number/123456789')
r = s.get('https://httpbin.org/cookies')
print(r.text)
Copy the code

Take a look at the results:

{
  "cookies": {"number": "123456789"}}Copy the code

These can see that the Cookies were successfully obtained! Now you can see the difference between having the same conversation and having different conversations.

So with Session, you can simulate the same Session without worrying about cookies. It is usually used to simulate a successful login before proceeding to the next step.

Sessions are commonly used to simulate opening different pages of the same site in a browser, and there will be a section on this later.

SSL Certificate Verification

Many websites require HTTPS, but some websites do not have HTTPS certificates, or the HTTPS certificates are not recognized by the CA. In this case, these websites may display an SSL certificate error message.

For example, if we open this URL with Chrome, we will get an error like “Your connection is not private”, as shown in the picture below:

There are some Settings in the browser to ignore certificate validation.

But what’s the problem if we want to use Requests for such a site? Let’s try it out in code:

import requests

response = requests.get('https://ssr2.scrape.center/')
print(response.status_code)
Copy the code

The running results are as follows:

requests.exceptions.SSLError: HTTPSConnectionPool(host='ssr2.scrape.center', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1.'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)')))
Copy the code

As you can see, SSLError is thrown directly because the certificate for the URL we requested is invalid.

So what if we have to crawl the site? We can use the verify parameter to control whether the certificate is verified, and if set to False, the certificate is no longer validated on request. If the verify parameter is not added, the default value is True and the verification is automatically performed.

We rewrite the code as follows:

import requests

response = requests.get('https://ssr2.scrape.center/', verify=False)
print(response.status_code)
Copy the code

This prints out the status code for the successful request:

/usr/local/lib/python37./site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
200
Copy the code

However, we found a warning that suggested we give it a certificate. We can mask this warning by setting it to be ignored:

import requests
from requests.packages import urllib3

urllib3.disable_warnings()
response = requests.get('https://ssr2.scrape.center/', verify=False)
print(response.status_code)
Copy the code

Or ignore warnings by capturing warnings to logs:

import logging
import requests

logging.captureWarnings(True)
response = requests.get('https://ssr2.scrape.center/', verify=False)
print(response.status_code)
Copy the code

Of course, we can also specify a local certificate as the client certificate, which can be a single file (containing the key and certificate) or a tuple containing two file paths:

import requests

response = requests.get('https://ssr2.scrape.center/', cert=('/path/server.crt'.'/path/server.key'))
print(response.status_code)
Copy the code

Of course, the above code is a demonstration example, we need to have the CRT and key files and specify their paths. Note that the key of the local private certificate must be in the decrypted state. Encrypted keys are not supported.

timeout

When the local network condition is not good or the server network response is too slow or even no response, we may wait for a very long time before receiving the response, or even fail to receive the response at last. To prevent the server from being unable to respond in a timely manner, you should set a timeout period, after which no response is received, an error is reported. This requires the timeout parameter. This time is calculated as the time it takes to send a request to the server and return a response. The following is an example:

import requests

r = requests.get('https://httpbin.org/get', timeout=1)
print(r.status_code)
Copy the code

In this way, we can set the timeout to one second and throw an exception if there is no response within one second.

In fact, the request is divided into two phases, namely connect and read.

The timeout set above will be used as the sum of timeouts for connecting and reading.

If you specify it separately, you can pass in a tuple:

r = requests.get('https://httpbin.org/get', timeout=(5.30))
Copy the code

If you want to wait forever, you can either set timeout to None or leave it blank because the default is None. That way, if the server is still running but is extremely slow to respond, take your time, it will never return a timeout error. Its usage is as follows:

r = requests.get('https://httpbin.org/get', timeout=None)
Copy the code

Or directly without parameters:

r = requests.get('https://httpbin.org/get')
Copy the code

The identity authentication

In the previous section we saw that when visiting a site with basic authentication enabled, we first encounter an authentication window, such as ssr3.corone.center /, as shown in the figure below.

This site has enabled basic authentication. In the previous section, we could use URllib to implement authentication, but the implementation is relatively cumbersome. So how do you do that in reqeusts? Of course there are ways.

We can use the built-in authentication feature in Requests with the Auth parameter, as shown in the following example:

import requests
from requests.auth import HTTPBasicAuth

r = requests.get('https://ssr3.scrape.center/', auth=HTTPBasicAuth('admin'.'admin'))
print(r.status_code)
Copy the code

The user name and password for this example site are both admin, which we can set directly here.

If the user name and password are correct, the request will be automatically authenticated and the 200 status code will be returned. If the authentication fails, the 401 status code is returned.

Of course, passing parameters to an HTTPBasicAuth class would be a bit cumbersome, so Requests provided a simpler way to pass a tuple that would use the HTTPBasicAuth class by default.

So the above code can be simply shortened as follows:

import requests

r = requests.get('https://ssr3.scrape.center/', auth=('admin'.'admin'))
print(r.status_code)
Copy the code

In addition, Requests provides additional authentication methods, such as OAuth authentication, which requires the OAuth package to be installed, using the following command:

pip3 install requests_oauthlib
Copy the code

An example method of using OAuth1 authentication is as follows:

import requests
from requests_oauthlib import OAuth1

url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
auth = OAuth1('YOUR_APP_KEY'.'YOUR_APP_SECRET'.'USER_OAUTH_TOKEN'.'USER_OAUTH_TOKEN_SECRET')
requests.get(url, auth=auth)
Copy the code

More detailed features can reference requests_oauthlib official documentation: requests-oauthlib.readthedocs.org/, here is no longer here.

The proxy Settings

For some sites, a few requests during testing will work. However, once large-scale crawling starts, for large-scale and frequent requests, the website may pop up the verification code, or jump to the login authentication page, or even directly block the IP address of the client, resulting in inaccessible within a certain period of time.

To prevent this from happening we need to set proxies to solve this problem using the Proxies parameter. It can be set like this:

import requests

proxies = {
  'http': 'http://10.10.10.10:1080'.'https': 'http://10.10.10.10:1080',
}
requests.get('https://httpbin.org/get', proxies=proxies)
Copy the code

Of course, running this instance directly may not work, because the agent may be invalid. You can search for a valid agent and try it out.

If the agent needs to use the authentication described above, it can be set up using a syntax like http://user:password@host:port, as shown in the following example:

import requests

proxies = {'https': 'http://user:[email protected]:1080/',}
requests.get('https://httpbin.org/get', proxies=proxies)
Copy the code

In addition to the basic HTTP proxy, Requests also supports a proxy for the SOCKS protocol.

First, you need to install the SOCKS library:

pip3 install "requests[socks]"
Copy the code

You can then use the SOCKS protocol proxy as shown in the following example:

import requests

proxies = {
    'http': 'socks5://user:password@host:port'.'https': 'socks5://user:password@host:port'
}
requests.get('https://httpbin.org/get', proxies=proxies)
Copy the code

Prepared Request

We can certainly send requests directly using the Requests library’s GET and POST methods, but have you ever wondered how that request is implemented inside Requests?

In effect, requests internally constructed a Request object, assigned it various parameters, including URL, headers, data, and so on, and then sent it directly. After the request is successful, a Response object will be obtained and then parsed.

So what type of Request is this? It is actually Prepared Request.

Prepare Request object (Prepared Request object)

from requests import Request, Session

url = 'https://httpbin.org/post'
data = {'name': 'germey'}
headers = {
    'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
    }
s = Session()
req = Request('POST', url, data=data, headers=headers)
prepped = s.prepare_request(req)
r = s.send(prepped)
print(r.text)
Copy the code

Here we introduce the Request class and construct a Request object using the URL, data, and headers parameters. In this case, you need to call the Prepare_request method of the Session to transform it into a Prepared Request object. Then you need to call the send method to send the Prepared Request object.

{
  "args": {},
  "data": ""."files": {},
  "form": {
    "name": "germey"
  },
  "headers": {
    "Accept": "* / *"."Accept-Encoding": "gzip, deflate"."Content-Length": "11"."Content-Type": "application/x-www-form-urlencoded"."Host": "httpbin.org"."User-Agent": "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"."X-Amzn-Trace-Id": "Root=1-5e5bd6a9-6513c838f35b06a0751606d8"
  },
  "json": null."origin": "167.220.232.237"."url": "http://httpbin.org/post"
}
Copy the code

As you can see, we have achieved the same POST request effect.

With the Request object, the Request can be treated as a separate object, so that in some scenarios we can directly manipulate the Request object, more flexible implementation of Request scheduling and various operations.

For more information, see the official documentation for requests: docs.python-requests.org/.

7. To summarize

That concludes the basic usage of the Requests library in this section, okay? Does it feel more convenient to use than urllib? In this section, we’ll use Requests to complete a site crawl to consolidate our knowledge of requests.

This section code: github.com/Python3WebS…

Thank you very much for reading. For more exciting content, please pay attention to my public account “Attack Coder” and “Cui Qingcai | Jingmi”.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

[2022 年] Python3 crawler tutorial – Easy to use for requests

1. Preparation

2. Instance introduction

3. GET request

The basic instance

Scraping of the page

Fetching binary data

Add headers

4. A POST request

5. Response

6. Advanced usage

File upload

A Cookie is set

The Session to maintain

SSL Certificate Verification

timeout

The identity authentication

The proxy Settings

Prepared Request

7. To summarize

[2022 年] Python3 crawler tutorial – Easy to use for requests

1. Preparation

2. Instance introduction

3. GET request

The basic instance

Scraping of the page

Fetching binary data

Add headers

4. A POST request

5. Response

6. Advanced usage

File upload

A Cookie is set

The Session to maintain

SSL Certificate Verification

timeout

The identity authentication

The proxy Settings

Prepared Request

7. To summarize

Related Posts

SpringBoot implements file upload and download through OSS (including local file upload and download)

Scheduled Task Processing Scheme in Java Application Cluster (mysql)

Troubleshooting the JAVA CPU overload problem and one-click query script