“This is the 17th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Preparation of crawler

The resources
- Python Network Data Acquisition ‘Turing Industry
- Master python crawler framework Scrapy posts and Telecommunications Press
- Python3 Web crawler
- Scrapy tutorial
The premise of knowledge
- url
- The HTTP protocol
- Web front end ‘HTML, CSS, JS
- ajax
- re, xpath
- xml

Introduction of the crawler

Definition of crawler: Web crawler (also known as web spider, web bot, and more commonly as web chaser in the FOAF community),

A program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator, or worm

Two big features
- The ability to download data or content as requested by the author
- Can automatically flow across the network
The three steps
- Download the web page
- Extract the correct information
- According to certain rules automatically jump to another page to perform the content of the last two steps
The crawler classification
- General crawler
- Dedicated crawler (Focused crawler)
Introduction to the Python Network package
- Python2. x: urllib, urllib2, urllib3, httplib, httplib2, requests
- Python3. x: urllib, urllib3, httplib2, requests
- Python2: urllib and urllib2, or requests
- Python3: urllib, requests

urllib

Contains the modules

Urllib. request: Opens and reads urls
Urllib. error: contains common errors generated by urllib.request, which are caught using a try
Urllib.parse: Contains methods to parse urls
Urllib. robotParse: parse the robots.txt file
Case v01

Case v01 uses urllib.request to request web content and print it out.
from urllib import request

if __name__ == '__main__':
    url = "https://www.zhaopin.com/taiyuan/"
    Open the corresponding URL and return the corresponding page
    rsp = request.urlopen(url)

    # read the returned result
    The read type is bytes
    html = rsp.read()
    print(type(html))

    If you want to convert bytes to a string, you need to decode it
    print(html.decode())
Copy the code

Page coding problem solved

Chardet can automatically detect the encoding format of the page file, but it may be wrong
Conda install chardet
Case v02

Case V02 uses Request download page to automatically detect page code.
import urllib
import chardet

if __name__ == '__main__':
    url = "Http://stock.eastmoney.com/news/1407, 20170807763593890 HTML"

    rsp = urllib.request.urlopen(url)

    html = rsp.read()

    # Automatic detection using Chardet
    cs = chardet.detect(html)
    print(type(cs))
    print(cs)

    # use get to ensure no error
    html = html.decode(cs.get("encoding"."utf-8"))
    print(html)
Copy the code

Return object of urlopen

Case v03

Case # v03
import urllib
import chardet

if __name__ == '__main__':
    url = "Http://stock.eastmoney.com/news/1407, 20170807763593890 HTML"

    rsp = urllib.request.urlopen(url)

    print(type(rsp))
    print(rsp)

    print("URL: {0}".format(rsp.geturl()))
    print("Info: {0}".format(rsp.info()))
    print("Code: {0}".format(rsp.getcode()))

    html = rsp.read()

    # use get to ensure no error
    html = html.decode()
Copy the code

Geturl: Returns the URL of the request object
Info: Meta information about the feedback object
Getcode: Returns HTTP code

Request. The use of the data

Two ways to access the network

get:

Use parameters to pass information to the server
The arguments are dict and encoded with Parse
Case v04

 Case # v04
 from urllib import request, parse

 To learn how to encode urls using the parse module
 
 if __name__ == '__main__':
     url = "http://www.baidu.com/s?"
     wd = input("Input your keyword: ")
 
     To use data, use a dictionary structure
     qs = {
         "wd": wd
     }
 
     # Convert URL encoding
     qs = parse.urlencode(qs)
     print(qs)
 
     fullurl =url + qs
     print(fullurl)
 
     It is not accessible if you use a readable url with parameters
     # fullurl = "http://www.baidu.com/s?wd= panda"
 
     rsp = request.urlopen(fullurl)
 
     html = rsp.read()
 
     # use get to ensure no error
     html = html.decode()
 
     print(html)
Copy the code

post

Generally used to pass parameters to the server
Post automatically encrypts information
If you want to use post information, use the data argument
Using POST means that the HTTP request may need to be changed:
- The content-type: application/x – www.form-urlencode
- Content-length: indicates the data Length
- In short, once you change the request method, be aware that the other request header information fits
Urllib.parse. urlencode can automatically convert strings to the above format
Case v05

 Example V05 using the PARSE module to simulate post request analysis baidu Dictionary analysis steps: 1. Open F12 2. Try to type the word girl, find that every time you type a letter, there is a request 3. Request the address is https://fanyi.baidu.com/sug 4. The FormData value is kw:girl 5. Json package '' ==>
 
 from urllib import request, parse
 Handle jSON-formatted modules
 import json
 
 The general process is as follows: 1. Use data to construct content, and then urlopen to open 2. Return a result in json format 3. The result should be the definition "" of girl.
 
 baseurl = 'https://fanyi.baidu.com/sug'
 
 The data stored to simulate forms must be in dict format
 data = {
     # girl is translated into English and should be entered by the user, in this case hard coded
     'kw': 'girl'
 }
 
 Data needs to be encoded using the parse module
 data = parse.urlencode(data).encode('utf-8')
 print(type(data))
 
 We need to construct a request header that should contain at least the length of the incoming data
 # request requires that the incoming request header be in dict format
 
 headers = {
     Because post requests are used, at least the Content-Length field should be included
     'Content-Length':len(data)
 }
 
 With headers, data, url, you can try to make a request
 
 rsp = request.urlopen(baseurl, data=data)
 
 json_data = rsp.read().decode('utf-8')
 print(type(json_data))
 print(json_data)
 
 Convert json strings into dictionaries
 json_data = json.loads(json_data)
 print(type(json_data))
 print(json_data)
 
 for item in json_data['data'] :print(item['v']."--", item['v'])
Copy the code

<class 'bytes'> <class 'str'> {"errno":0,"data":[{"k":"girl","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f;  \u5e74\u8f7b\u5973\u5b50; \u5973\u90ce;"},{"k":"girls","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f;  \u5e74\u8f7b\u5973\u5b50; \u5973\u90ce; girl\u7684\u590d\u6570;"},{"k":"girlfriend","v":"n. \u5973\u670b\u53cb;  \u5973\u60c5\u4eba; (\u5973\u5b50\u7684)\u5973\u4f34\uff0c\u5973\u53cb;"},{"k":"girl friend","v":" \u672a\u5a5a\u59bb;  \u5973\u6027\u670b\u53cb; "},{"k":"Girls' Generation","v":" \u5c11\u5973\u65f6\u4ee3\uff08\u97e9\u56fdSM\u5a31\u4e50\u6709\u9650\u516c\u53f8\u4e8e2007\u5e74\u63a8\u51fa\u7684\u4e5d \u540d\u5973\u5b50\u5c11\u5973\u7ec4\u5408\uff09;"}]} <class 'dict'> {'errno': 0, 'data': [{'k': 'girl', 'v': 'n. girl; girl; her daughter; young woman; the girl;'}, {' k ':' girls', 'v' : 'n. girl; girl, daughter, a young woman; girl; girl's plural;'}, {' k ':' girlfriend ', 'v' : 'n. Girlfriend; female lover; girl (woman), his girlfriend;'}, {' k ':' girl friend ', 'v' : 'fiancee; women;}, {' k' : "Girls' Generation", "v" : 'Girls' Generation' (a nine-girl group launched by SM Entertainment In 2007) Girl; The girl; Daughter; A young woman; Girl; -- n. girl; The girl; Daughter; A young woman; Girl; N. girl; The girl; Daughter; A young woman; Girl; Girl's plural; -- n. girl; The girl; Daughter; A young woman; Girl; Girl's plural; A girlfriend; Female lover; (a woman's) companion -- N. girlfriend; Female lover; (a woman's) companion His fiancee. Female friends; -- Fiancee; Female friends; Girls' Generation (a nine-girl group launched by SM Entertainment In 2007); -- Girls' Generation (a nine-girl group launched by SM Entertainment In 2007);Copy the code

In order to set up more request information, it is not useful to simply use the urlopen function
You need to use the Request. request class
Case v06

 V05: Post Request v05: Post Request V05: Post Request Open F12 2. Try to type the word girl, find that every time you type a letter, there is a request 3. Request the address is https://fanyi.baidu.com/sug 4. The FormData value is kw:girl 5. Json package '' ==>
 
 from urllib import request, parse
 Handle jSON-formatted modules
 import json
 
 The general process is as follows: 1. Use data to construct content, and then urlopen to open 2. Return a result in json format 3. The result should be the definition "" of girl.
 
 baseurl = 'https://fanyi.baidu.com/sug'
 
 The data stored to simulate forms must be in dict format
 data = {
     # girl is translated into English and should be entered by the user, in this case hard coded
     'kw': 'girl'
 }
 
 Data needs to be encoded using the parse module
 data = parse.urlencode(data).encode('utf-8')
 
 We need to construct a request header that should contain at least the length of the incoming data
 # request requires that the incoming request header be in dict format
 
 headers = {
     Because post requests are used, at least the Content-Length field should be included
     'Content-Length':len(data)
 }
 
 Construct an instance of Request
 
 req = request.Request(url=baseurl, data=data, headers=headers)
 
 Since a Request instance has been constructed, all Request information can be encapsulated in the Request instance
 rsp = request.urlopen(req)
 
 json_data = rsp.read().decode('utf-8')
 print(type(json_data))
 print(json_data)
 
 Convert json strings into dictionaries
 json_data = json.loads(json_data)
 print(type(json_data))
 print(json_data)
 
 for item in json_data['data'] :print(item['v']."--", item['v'])
Copy the code

<class 'str'> {"errno":0,"data":[{"k":"girl","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f; \u5e74\u8f7b\u5973\u5b50;  \u5973\u90ce;"},{"k":"girls","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f; \u5e74\u8f7b\u5973\u5b50; \u5973\u90ce;  girl\u7684\u590d\u6570;"},{"k":"girlfriend","v":"n. \u5973\u670b\u53cb; \u5973\u60c5\u4eba;  (\u5973\u5b50\u7684)\u5973\u4f34\uff0c\u5973\u53cb;"},{"k":"girl friend","v":" \u672a\u5a5a\u59bb;  \u5973\u6027\u670b\u53cb; "},{"k":"Girls' Generation","v":" \u5c11\u5973\u65f6\u4ee3\uff08\u97e9\u56fdSM\u5a31\u4e50\u6709\u9650\u516c\u53f8\u4e8e2007\u5e74\u63a8\u51fa\u7684\u4e5d \u540d\u5973\u5b50\u5c11\u5973\u7ec4\u5408\uff09;"}]} <class 'dict'> {'errno': 0, 'data': [{'k': 'girl', 'v': 'n. girl; girl; her daughter; young woman; the girl;'}, {' k ':' girls', 'v' : 'n. girl; girl, daughter, a young woman; girl; girl's plural;'}, {' k ':' girlfriend ', 'v' : 'n. Girlfriend; female lover; girl (woman), his girlfriend;'}, {' k ':' girl friend ', 'v' : 'fiancee; women;}, {' k' : "Girls' Generation", "v" : 'Girls' Generation' (a nine-girl group launched by SM Entertainment In 2007) Girl; The girl; Daughter; A young woman; Girl; -- n. girl; The girl; Daughter; A young woman; Girl; N. girl; The girl; Daughter; A young woman; Girl; Girl's plural; -- n. girl; The girl; Daughter; A young woman; Girl; Girl's plural; A girlfriend; Female lover; (a woman's) companion -- N. girlfriend; Female lover; (a woman's) companion His fiancee. Female friends; -- Fiancee; Female friends; Girls' Generation (a nine-girl group launched by SM Entertainment In 2007); -- Girls' Generation (a nine-girl group launched by SM Entertainment In 2007);Copy the code

Finally, welcome to pay attention to my personal wechat public account “Little Ape Ruochen”, get more IT technology, dry goods knowledge, hot news

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python crawler tutorial (Python crawler)

Preparation of crawler

Introduction of the crawler

urllib

Python crawler tutorial (Python crawler)

Preparation of crawler

Introduction of the crawler

urllib

Related Posts

Critical praise! This article takes a closer look at repositories more powerful than Requests

DDD Domain Driven Strategy (6) Rhombus symmetric architecture

MySQL Advanced series: Implementation of multi-version concurrency control MVCC