The Requests library is the most, most, most important and common Python crawler library, so be sure to master it.

Let’s get to know this library

The Requests library is the most, most, most important and common Python crawler library, so be sure to master it.

Let’s get to know this library

import requests url = 'http://www.baidu.com' r = requests.get(url) print type(r) print r.status_code print r.encoding #print r.tent print r.cookies  <class 'requests.models.Response'> 200 ISO-8859-1 <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>Copy the code

2.Get request mode

Values = {'user':'aaa','id':'123'} url = 'http://www.baidu.com' r = requests. Get (url,values) print r http://www.baidu.com/?user=aaa&id=123Copy the code

3.Post request mode

values = {'user':'aaa','id':'123'} url = 'http://www.baidu.com' r = requests.post(url,values) print r.url #print r.text Got: http://www.baidu.com/Copy the code

4. Request headers processing

User_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'} header = {'User-Agent':user_agent} url = 'http://www.baidu.com/' r = requests.get(url,headers=header) print r.contentCopy the code

Note that many times our server will verify that the request is from a browser, so we need to disguise the request as a browser in the request header. Generally, it is better to disguise as a browser when making a request to prevent errors such as access denial, which is also a kind of anti-crawler strategy

Special specifications, later no matter what we do request, must take the headers, don’t be lazy to save trouble, put here as a traffic rules to understand, running red lights will not necessarily produce risk but not security, in order to save trouble, we follow the red light green line stop is enough, do web crawler request, too, must add the headers, in case of error.

User_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'} header = {'User-Agent':user_agent} url = 'http://www.qq.com/' request = urllib2.Request(url,headers=header) response = Urllib2.urlopen (request) print response.read().decode(' GBK ')Copy the code

Open www.qq.com in your browser and press F12 to view user-agent:

User-agent: some servers or proxies use this value to determine whether a request is sent by a browser. Content-type: When using the REST interface, the server checks this value to determine how to parse the Content in the HTTP Body. Application/XML: Application /json for XML RPC calls such as RESTful/SOAP; Application /x-www-form-urlencoded for JSON RPC calls: When the browser submits a Web form When the RESTful or SOAP service provided by the server is used, an incorrect content-Type setting will cause the server to reject the service

5. Process the response code and headers

url = 'http://www.baidu.com' r = requests.get(url) if r.status_code == requests.codes.ok: Print r.tatus_code print r.sealanders print r.sealanders. Get ('content-type') R.aise_for_status () gets: 200 {' content-encoding ': 'gzip', 'transfer-encoding ': 'chunked',' set-cookie ': 'BDORZ=27315; max-age=86400; domain=.baidu.com; Path =/', 'Server': 'bfE /1.0.8.18',' last-modified ': 'Mon, 23 Jan 2017 13:27:57 GMT', 'Connection': 'Keep-Alive', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Date': 'Wed, 17 Jan 2018 07:21:21 GMT', 'Content-Type': 'text/html'} text/htmlCopy the code

6. The cookie

Url = 'https://www.zhihu.com/' r = requests. Get (url) print r.cookie print r.cookie. Keys ()  <RequestsCookieJar[<Cookie aliyungf_tc=AQAAACYMglZy2QsAEnaG2yYR0vrtlxfz for www.zhihu.com/>]> ['aliyungf_tc']Copy the code

7. Redirection and history messages

Allow_redirectsy (” True “) allows redirects, and “False” disables redirects.

R = requisition.get (url,allow_redirects = True) print r = requisition.get (url,allow_redirects = True)  http://www.baidu.com/ 200 []Copy the code

8. Timeout Settings

The timeout option is set with the python parameter timeout url = ‘http://www.baidu.com’ r = requests. Get (url,timeout = 2)

9. Set the proxy

proxis = {
 'http':'http://www.baidu.com',
 'http':'http://www.qq.com',
 'http':'http://www.sohu.com',

}

url = 'http://www.baidu.com'
r = requests.get(url,proxies = proxis)Copy the code

Author: Ni Pingyu

Developer Salon is live from February 28th to March 28th