Crawler network request module

preface

Those who have done crawlers basically understand that crawlers are actually the three axes. First, identify the site we want to crawl. Second, initiate a request to the site, analyze the site data structure, third, crawl data, save data. That’s it. Basically, our crawler needs are done. In the second step, we need to make a request to the target site, which is Python’s unrivalled advantage, as it has both the built-in urllib network request module and the well-known third-party Requests network request module. Have used all say good, today we will study, network module this chapter related knowledge.

Python’s built-in module urllib

It used to be cumbersome to import URllib and urllib2 when Python2 made a request to a website. There is an integration in Python3. We can use urllib directly.

urllib.request

  • Common methods
    • Urllib.request.urlopen (” url “) makes a request to a website and gets a response
    • Byte stream = response.read()
    • String = response.read().decode(” utF-8 “)
    • Urllib.request.Request” url “,headers=” dictionary “) urlopen() does not support user-agent reconstruction
Import urllib.request response = urllib.request.urlopen('https://www.baidu.com') # read() to read the contents of the response object print(response.read())Copy the code

The execution result

As you can see from the result, we now have a byte stream, or bytes, of data. At this point we can decode() the bytes data into STR data

html = response.read().decode('utf-8')
print(type(html),html)
Copy the code

The execution result

In this way we can simply initiate a request like a website and get its response data. Now we’re going to think about this reverse crawl. Let’s say the most basic anti-crawl is user-agent. But the urlopen() method does not support U-A. So let’s talk about another method, urllib.request.request (add user-agent)

Import urllib.request # url = 'https://www.baidu.com/' headers = {' user-agent ': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'} # 1. Response = urllib.request.Request(url,headers=headers) # 2 Urlopen () res = urllib.request.urlopen(response) # 3. Decode ('utf-8') HTML = res.read().decode('utf-8') print(HTML)Copy the code

Now we can summarize the usage

1. Build the Request object using the Request() method

2. Use urlopen() to get the response object

3. Use read().decode(” UTF-8 “) of the response object to retrieve the content

  • A method that responds to an object
    • Read () reads the contents of the server response
    • Getcode () returns the HTTP response code
    • Geturl () returns the URL of the actual data (to prevent redirection problems)
print(res.getcode())
print(res.geturl())
Copy the code

The execution result

urllib.parse

  • Commonly used method
  • Urlencode (dictionary)
  • Quote (string)

What do these two methods do? We know that browsers don’t recognize Chinese. Although Chinese is displayed on the web site, it is actually the browser that handles it. For example, let’s make a copy of this url and go through PyCharm

So what’s going on here? When we request a url in the browser, the browser will encode the URL, except for English letters, numbers and some symbols, all the other uses a percent + hexadecimal code. So when writing code, we sometimes need to manually code the Chinese.

Import urllib.parse name = {'wd':' One Thieves '} name = urllib.parse. Urlencode (name) print(name)Copy the code

R = result = urllib.parse.quote(r) print(result)Copy the code

Third party network request module Requests

In advance, the famous Requests module will be familiar to those who have known crawlers. As it is introduced, let HTTP serve humans. Without further ado, we began to learn

The installation

Since the Requests module is third-party, we need to install it. Here it is recommended to use the way to install the source, we here to douban source for example

pip install requests -i https://pypi.douban.com/simple
Copy the code

Installing by switching sources is very fast. Now that the preparation is done, it’s a lot easier

Request Common methods

  • Requests. Get (url)
Response = requests. Get ('https://www.baidu.com/') print(response)Copy the code

The execution result

So we simply made a request to Baidu and got a Response object. It gives us back a status code. 200 means the request was successful. Our GET request doesn’t just ask for a website. You can also add a params parameter. For example, the HEADERS request header

Import requests # headers = {' user-agent ': "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/73.0.3683.86 Safari/537.36"} # requests.get('https://www.baidu.com/',headers=headers) print(response)Copy the code

For example, if I wanted to pass China in the URL, we came up with the idea of encoding it in urlencode. In the Reqeusts module, however, this is not so complicated. The following figure

Import requests # headers = {' user-agent ': "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/73.0.3683.86 Safari/537.36"} kw = {'kw':' China '} # requests.get('https://www.baidu.com/',params=kw,headers=headers) print(response.text)Copy the code

The execution result

Response method

We want to make a request to a website, I want to get the data of the page what to do? This isn’t too difficult, as the Requests module does it for us

Print (response.text) returns bytes from STR print(response.content)Copy the code

Although response.text can obtain the data of web pages, Chinese garbled characters often appear. At this point we use another way to get the data of the page response.content.decode(‘ UTF-8 ‘). This way to obtain web source code can be a good solution to the Chinese garble problem.

Response. content returns byte stream data. So how is STR data possible?

The first reaction is that one prints a string and the other prints a byte stream. First of all, respons.content is the data scraped directly from the website without any processing, that is, without any coding. Response. text is the string that the Requests module decoded from respons.content. The students think that decoding does not need a specified way ah. We didn’t specify a decoder here, but the Requests library can guess one of our decoders. So when you use response.text to query the response content, garbled characters will appear. Decode respons.content.decode(‘ UTF-8 ‘)

Import requests # headers = {' user-agent ': "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/73.0.3683.86 Safari/537.36"} kw = {'kw':' China '} # requests.get('https://www.baidu.com/',params=kw,headers=headers) print(response.content.decode('utf-8'))Copy the code

Requests sends post requests

We just sent get requests using Requests. So let’s use a case of Youdao dictionary to send post requests.

Post requests need to carry data, which is typically in a form form

This is the data in our form. This data is not displayed on the URL. Let’s make a little piece of software by sending post requests through the Requests module to carry the data directly with the code below

Data = {' I ': key, 'from': 'AUTO', 'smartresult': 'dict', 'client': 'fanyideskweb', 'salt': '15880623642174', 'sign': 'c6c2e897040e6cbde00cd04589e71d4e', 'ts': '1588062364217', 'bv': '42160534cfa82a6884077598362bbc9d', 'doctype': 'json', 'version': '2.1', 'keyfrom' : 'fanyi. Web', 'the action: 'FY_BY_CLICKBUTTION' } url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule' headers = { 'user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Res = requests. Post (url,data=data,headers=headers) Res. encoding = 'utF-8' # fetch data HTML = res.text # convert to dictionary r_dict = json.loads(HTML) # parse content = r_dict['translateResult'][0][0]['tgt'] print(content)Copy the code

The execution result

Cookie

Cookie is a very important data, and its functions are mainly reflected in two aspects

  • To simulate the login
  • Some sites do Cookie creep

So what exactly is a Cookie?

The following is from a web quote

Cookies, and sometimes the plural form of Cookies, are data stored (usually encrypted) on a user’s local terminal by a web site for identification and session tracking; In simple terms, it can put you in the visit site generated some behavior information to read and save, commonly used is when we visit some pages prompt us whether we need to save the user name and password, the next login can be automatically logged in, without re-login

Since the advent of Cookie technology, it has become a focus of debate among Web users and Web developers. Some network users, even some experienced Web experts, are dissatisfied with its generation and promotion. This is not because of the weak function of Cookie technology or other technical performance reasons, but because the use of Cookie poses harm to the privacy of network users. Because a Cookie is a small text file stored by a Web server on a user’s browser, it contains information about the user

Now that we know about cookies, let’s look at their first useTo simulate the loginFor example, now Zhihu is in a state of login, and the column of hot list is selected

So we can use NetWork to find cookies. A mock login is performed using cookie information

Import requests headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36', 'cookie': '_xsrf=1SMkDEbBof93pTCRd5MmPz8cmmOuAsaU; _zap=3a8fd847-c5d4-45cf-84a3-24d508f580f6; _ga = GA1.2.2058930090.1594280819; d_c0="AICeuVa2jBGPTuvzpsC3VFkq3TulCqxCfNQ=|1594280816"; z_c0="2|1:0|10:1594901209|4:z_c0|92:Mi4xRjdYeENBQUFBQUFBZ0o2NVZyYU1FU1lBQUFCZ0FsVk4yWkQ5WHdBbzV5TkZwYUs4a0RpNWdRUms2Yy1O QlRkaER3|3e67794db7e5f5ec768144d12fdac5ddf9be6d575cf0da3081bd59c5fd132558"; tshl=; tst=h; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49 = 1600419464160765, 648160867, 068160108, 280; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1601108280; _gid = GA1.2.809449579.1601108280; KLBRSID=9d75f80756f65c61b0a50d80b4ca9b13|1601108281|1601108278; SESSIONID=sP67fUKhcmoakcsAU5RzS0NBNUzVG9ocD2JR2F5BsgF; JOID=VF4SAULnQMqzo0LPa-OP2bVp1097lBH5j8oXhQaLLaTXwy68K9-9quisQ8tlyVRuZgOhBpkxYtdJhmXXDe_IHYo=; osd=V1wTC0_kQsu5rkHNaumC2rdo3UJ4lhDzgskVhAyGLqbWySO_Kd63p-uuQsFoylZvbA6iBJg7b9RLh2_aDu3JF4c=' } url = 'https://www.zhihu.com/hot' res = requests.get(url, headers=headers) with open('zhihu.html','w',encoding='utf-8') as f: f.write(res.text)Copy the code

This HTML file can be opened in a browser

Let’s take a look at the second effect of cookies, reverse crawling

The picture below is a list of trains from Beijing to Shanghai. This list of trips is data obtained through an Ajax request.

So how do we get this list of rides by crawler? The first approach can be found through Selenium, a technology we will cover in a later article. The second way is to analyze its data interface

After analysis, the data is in the result

We can simply request the URL of the data interface

Import requests headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/79.0.3945.88 Safari/537.36'} URL = 'https://kyfw.12306.cn/otn/leftTicket/query?leftTicketDTO.train_date=2020-11-27&leftTicketDTO.from_station=BJP&leftTicke tDTO.to_station=SHH&purpose_codes=ADULT' res = requests.get(url,headers=headers) print(res.content.decode('utf-8'))Copy the code

The execution result

This is clearly not what we want. Let’s try cookies at this point

Import requests headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome / 79.0.3945.88 Safari / 537.36 ', 'cookies' :' _uab_collina = 159490169403897938828076; JSESSIONID=9CCC55A5791112A1D991D16D05B8DE6C; _jc_save_wfdc_flag=dc; _jc_save_fromStation=%u5317%u4EAC%2CBJP; BIGipServerotn = 1206911498.38945.0000; BIGipServerpool_passport = 149160458.50215.0000; RAIL_EXPIRATION=1606609861531; RAIL_DEVICEID=Q2D75qw5BZafd0LCbLz0B0CWC8cdKlDp8taGuqQjNvLGk3cYKCg1Y4KoXbWHpTmr6iY988VhF0wHULKY9RimC4dWVelVHcf94Q3FRxQ0Lf bzRqvTvC19gq7XNKs0aQgeBhCZ5dVfllX8gW5GHSoeQ10di_JL7sLg; route=6f50b51faa11b987e576cdb301e545c4; _jc_save_toStation=%u4E0A%u6D77%2CSHH; _jc_save_fromDate=2020-11-27; _jc_save_toDate=2020-11-25' } url = 'https://kyfw.12306.cn/otn/leftTicket/query?leftTicketDTO.train_date=2020-11-27&leftTicketDTO.from_station=BJP&leftTicke tDTO.to_station=SHH&purpose_codes=ADULT' res = requests.get(url,headers=headers) print(res.content.decode('utf-8'))Copy the code

The execution result

Session

The user is identified by the information recorded on the server. So this session is just a session, it’s not a session that we’re talking about on the Web but just to be clear, why do we do session persistence? The Requests library session object can hold certain parameters across requests. In plain English, if you successfully log in to a site using the Session object, the cookies and other parameters of the site will be used by default when you use the session object again

Especially when stay logged in using the most, in some sites, or app crawl, when some forced landing, or don’t log in the data returned is false or incomplete data, then we won’t be able to do it every request to sign off to do, just need to keep the function of the session, we can only log in once, Then stay that way and do something else or more