Python novice crawler 4: Web crawler Requests

This is the 8th day of my participation in the August More Text Challenge

In the last article we looked at using Python’s built-in urllib module to crawl our first page. But in fact, URllib is not the most widely used in Internet companies today. Today, we introduce Requests, one of the most popular and comprehensive web crawler frameworks.

Why study Requests

Before we answer that question, let’s talk about Requests:

Requests allows you to send all-natural, plant-fed HTTP/1.1 Requests without manual labor. You don’t need to manually add query strings to urls, nor do you need to form encode the POST data. Keep-alive and HTTP connection pooling are 100% automated, powered by urllib3 embedded in Requests.

This is an excerpt from the Official Requests document, and it looks like a boast. In fact, Requests significantly reduced our development and configuration efforts. The 32K star on its GitHub page also boasts of its success and good genes.

Why study Requests? For starters, the main reasons are:

Requests has a wealth of learning resources on the Internet. A Search for “Requests crawlers” on Baidu yielded more than 160,000 results. This means that the technology for Requests is mature. Especially for beginners, a rich learning material can reduce the number of “digging” and “pit” times in learning;
Requests is officially available in Chinese. This is the best resource for newcomers, especially those who are not very good at English. The documentation on the official website provides detailed and very accurate function definitions and instructions. If there are problems in the development process, Baidu, Google, Stack Overflow…… When all search methods have been tried and nothing can solve the problem, looking through official documents is the safest and fastest solution.

Requests the early experience

Install the Requests

Because Requests is a third-party library, we need to install it manually. Enter in CMD console

pip install requests

When the console prompts that the installation was successful, we go into Python and import Requests to verify that the installation was successful.

(Please forgive me for using screenshots from Linux. As I write this part, my Windows PC “tragically”)

Rewrite urllib access page code

Using Requests to crawl a web page requires only a few lines of code and is far less complex than URllib.
```
import requests
url = "http://juejin.cn/"
web_data = requests.get(url)
web_info = web_data.text
print(web_info)
Copy the code
```
Let’s run this little program and print the result:
```
.<p>Gold digging community.</p>.Copy the code
```
Opponents! Requests automatically checked the encoding for us and displayed Chinese properly!

Let’s walk through this code in detail
```
import requests
url = "http://juejin.cn/"
web_data = requests.get(url)
Copy the code
```
The above code is easy to understand. The first line imports the Requests library, the second defines the URLS we want to crawl, and the third line calls the Get () method in Requests directly to access a web page
```
web_info = web_data.text
Copy the code
```
When we make a GET request, Requests makes an educated guess about the encoding based on the HTTP header, so when our request accesses web_data.text, Requests parses it with its inferred text encoding.
Custom request headers

What is a request header? HTTP request headers, which an HTTP client (such as a browser) must specify when sending a request to the server (usually GET or POST). The client can also choose to send additional headers if necessary.

Remember the “mock browser” behavior we mentioned in our last article? Yes, the browser flag is also in the request header.

The figure above is a typical request header. In a Request, we can easily construct our own Request headers

header = {		     'Accept':'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8 '.'Accept-Language':'zh-CN,zh; Q = 0.9 '.'User-Agent':'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
r = requests.get("http://gitbook.cn/", headers=headers)
Copy the code

The usefulness of the Cookie

The stateless HTTP protocol is used to transfer data to the Internet, which means that the client and server will be disconnected after the data transfer is complete. This is where we need a mechanism to keep the session connected all the time. Cookies have played this role since before sessions existed. That is, the cookie’s small amount of information helps us track the session. Generally, this information records the user’s identity.

What is a Cookie? Simply put, it is a set of data that records your username and password, giving you direct access to your account space. It’s no use talking. Let’s try it ourselves.

This time we tried to access CSDN. First of all, this is the personal page displayed after I have logged in. !

Before adding cookies, let’s try to access this page.

import requests
url = "https://my.csdn.net/"
web_data = requests.get(url)
web_info = web_data.text
print(web_info)
Copy the code

The running results are as follows:

The result is — you either log in or register.

So, what about cookies? We get our Cookie first. If you’re using Chrome, just right click on view-network and refresh the page to see the corresponding Cookie in the request header.

Attention! Cookie data is very private personal data! If obtained by others, you can log in to your relevant account by some conventional means. Therefore, please do not show your COokie information to others at will!

Let’s revise the code again

import requests
url = 'https://my.csdn.net/'
header = {
    'Cookie':'Personal Cookie hidden here'.'User-Agent' :'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
web_data = requests.get(url,headers=header)
web_info = web_data.text
print(web_info)
Copy the code

Run it and see the results

We see that the final result of the crawl already contains the relevant articles bookmarked at the time of login! Cookie set successful!

Contents summary

Using Requests simplifies much of the complex development process and allows us to focus more on the web crawl technology itself.
For the request header, we can directly customize, you can refer to this article to understand the details of the request header and request body;
Cookies are very important private data. With cookies, you can crawl the information of relevant accounts. Do not easily show your Cookie to others

Python novice crawler 4: Web crawler Requests

Why study Requests

Requests the early experience

Attention! Cookie data is very private personal data! If obtained by others, you can log in to your relevant account by some conventional means. Therefore, please do not show your COokie information to others at will!

Contents summary

Cookies are very important private data. With cookies, you can crawl the information of relevant accounts. Do not easily show your Cookie to others

Related Posts

Three copies of college students’ efforts to change their fate | 2021 one’s deceased father grind, fitness, writing, work

Introduction to Microservices fault tolerance and flow Limiting Hystrix

Excel export in Vernacular