“This is the 17th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Preparation of crawler

  • The resources
    • Python Network Data Acquisition ‘Turing Industry
    • Master python crawler framework Scrapy posts and Telecommunications Press
    • Python3 Web crawler
    • Scrapy tutorial
  • The premise of knowledge
    • url
    • The HTTP protocol
    • Web front end ‘HTML, CSS, JS
    • ajax
    • re, xpath
    • xml

Introduction of the crawler

  • Definition of crawler: Web crawler (also known as web spider, web bot, and more commonly as web chaser in the FOAF community),

A program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator, or worm

  • Two big features
    • The ability to download data or content as requested by the author
    • Can automatically flow across the network
  • The three steps
    • Download the web page
    • Extract the correct information
    • According to certain rules automatically jump to another page to perform the content of the last two steps
  • The crawler classification
    • General crawler
    • Dedicated crawler (Focused crawler)
  • Introduction to the Python Network package
    • Python2. x: urllib, urllib2, urllib3, httplib, httplib2, requests
    • Python3. x: urllib, urllib3, httplib2, requests
    • Python2: urllib and urllib2, or requests
    • Python3: urllib, requests

urllib

  • Contains the modules
    • Urllib. request: Opens and reads urls
    • Urllib. error: contains common errors generated by urllib.request, which are caught using a try
    • Urllib.parse: Contains methods to parse urls
    • Urllib. robotParse: parse the robots.txt file
    • Case v01
    Case v01 uses urllib.request to request web content and print it out.
    from urllib import request
    
    if __name__ == '__main__':
        url = "https://www.zhaopin.com/taiyuan/"
        Open the corresponding URL and return the corresponding page
        rsp = request.urlopen(url)
    
        # read the returned result
        The read type is bytes
        html = rsp.read()
        print(type(html))
    
        If you want to convert bytes to a string, you need to decode it
        print(html.decode())
    Copy the code
  • Page coding problem solved
    • Chardet can automatically detect the encoding format of the page file, but it may be wrong
    • Conda install chardet
    • Case v02
    Case V02 uses Request download page to automatically detect page code.
    import urllib
    import chardet
    
    if __name__ == '__main__':
        url = "Http://stock.eastmoney.com/news/1407, 20170807763593890 HTML"
    
        rsp = urllib.request.urlopen(url)
    
        html = rsp.read()
    
        # Automatic detection using Chardet
        cs = chardet.detect(html)
        print(type(cs))
        print(cs)
    
        # use get to ensure no error
        html = html.decode(cs.get("encoding"."utf-8"))
        print(html)
    Copy the code
  • Return object of urlopen
    • Case v03
    Case # v03
    import urllib
    import chardet
    
    if __name__ == '__main__':
        url = "Http://stock.eastmoney.com/news/1407, 20170807763593890 HTML"
    
        rsp = urllib.request.urlopen(url)
    
        print(type(rsp))
        print(rsp)
    
        print("URL: {0}".format(rsp.geturl()))
        print("Info: {0}".format(rsp.info()))
        print("Code: {0}".format(rsp.getcode()))
    
        html = rsp.read()
    
        # use get to ensure no error
        html = html.decode()
    Copy the code
    • Geturl: Returns the URL of the request object
    • Info: Meta information about the feedback object
    • Getcode: Returns HTTP code
  • Request. The use of the data
    • Two ways to access the network
      • get:
        • Use parameters to pass information to the server
        • The arguments are dict and encoded with Parse
        • Case v04
         Case # v04
         from urllib import request, parse
        
         To learn how to encode urls using the parse module
         
         if __name__ == '__main__':
             url = "http://www.baidu.com/s?"
             wd = input("Input your keyword: ")
         
             To use data, use a dictionary structure
             qs = {
                 "wd": wd
             }
         
             # Convert URL encoding
             qs = parse.urlencode(qs)
             print(qs)
         
             fullurl =url + qs
             print(fullurl)
         
             It is not accessible if you use a readable url with parameters
             # fullurl = "http://www.baidu.com/s?wd= panda"
         
             rsp = request.urlopen(fullurl)
         
             html = rsp.read()
         
             # use get to ensure no error
             html = html.decode()
         
             print(html)
        Copy the code
      • post
        • Generally used to pass parameters to the server
        • Post automatically encrypts information
        • If you want to use post information, use the data argument
        • Using POST means that the HTTP request may need to be changed:
          • The content-type: application/x – www.form-urlencode
          • Content-length: indicates the data Length
          • In short, once you change the request method, be aware that the other request header information fits
        • Urllib.parse. urlencode can automatically convert strings to the above format
        • Case v05
         Example V05 using the PARSE module to simulate post request analysis baidu Dictionary analysis steps: 1. Open F12 2. Try to type the word girl, find that every time you type a letter, there is a request 3. Request the address is https://fanyi.baidu.com/sug 4. The FormData value is kw:girl 5. Json package '' ==>
         
         from urllib import request, parse
         Handle jSON-formatted modules
         import json
         
         The general process is as follows: 1. Use data to construct content, and then urlopen to open 2. Return a result in json format 3. The result should be the definition "" of girl.
         
         baseurl = 'https://fanyi.baidu.com/sug'
         
         The data stored to simulate forms must be in dict format
         data = {
             # girl is translated into English and should be entered by the user, in this case hard coded
             'kw': 'girl'
         }
         
         Data needs to be encoded using the parse module
         data = parse.urlencode(data).encode('utf-8')
         print(type(data))
         
         We need to construct a request header that should contain at least the length of the incoming data
         # request requires that the incoming request header be in dict format
         
         headers = {
             Because post requests are used, at least the Content-Length field should be included
             'Content-Length':len(data)
         }
         
         With headers, data, url, you can try to make a request
         
         rsp = request.urlopen(baseurl, data=data)
         
         json_data = rsp.read().decode('utf-8')
         print(type(json_data))
         print(json_data)
         
         Convert json strings into dictionaries
         json_data = json.loads(json_data)
         print(type(json_data))
         print(json_data)
         
         for item in json_data['data'] :print(item['v']."--", item['v'])
        Copy the code
        <class 'bytes'> <class 'str'> {"errno":0,"data":[{"k":"girl","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f;  \u5e74\u8f7b\u5973\u5b50; \u5973\u90ce;"},{"k":"girls","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f;  \u5e74\u8f7b\u5973\u5b50; \u5973\u90ce; girl\u7684\u590d\u6570;"},{"k":"girlfriend","v":"n. \u5973\u670b\u53cb;  \u5973\u60c5\u4eba; (\u5973\u5b50\u7684)\u5973\u4f34\uff0c\u5973\u53cb;"},{"k":"girl friend","v":" \u672a\u5a5a\u59bb;  \u5973\u6027\u670b\u53cb; "},{"k":"Girls' Generation","v":" \u5c11\u5973\u65f6\u4ee3\uff08\u97e9\u56fdSM\u5a31\u4e50\u6709\u9650\u516c\u53f8\u4e8e2007\u5e74\u63a8\u51fa\u7684\u4e5d \u540d\u5973\u5b50\u5c11\u5973\u7ec4\u5408\uff09;"}]} <class 'dict'> {'errno': 0, 'data': [{'k': 'girl', 'v': 'n. girl; girl; her daughter; young woman; the girl;'}, {' k ':' girls', 'v' : 'n. girl; girl, daughter, a young woman; girl; girl's plural;'}, {' k ':' girlfriend ', 'v' : 'n. Girlfriend; female lover; girl (woman), his girlfriend;'}, {' k ':' girl friend ', 'v' : 'fiancee; women;}, {' k' : "Girls' Generation", "v" : 'Girls' Generation' (a nine-girl group launched by SM Entertainment In 2007) Girl; The girl; Daughter; A young woman; Girl; -- n. girl; The girl; Daughter; A young woman; Girl; N. girl; The girl; Daughter; A young woman; Girl; Girl's plural; -- n. girl; The girl; Daughter; A young woman; Girl; Girl's plural; A girlfriend; Female lover; (a woman's) companion -- N. girlfriend; Female lover; (a woman's) companion His fiancee. Female friends; -- Fiancee; Female friends; Girls' Generation (a nine-girl group launched by SM Entertainment In 2007); -- Girls' Generation (a nine-girl group launched by SM Entertainment In 2007);Copy the code
        • In order to set up more request information, it is not useful to simply use the urlopen function
        • You need to use the Request. request class
        • Case v06
         V05: Post Request v05: Post Request V05: Post Request Open F12 2. Try to type the word girl, find that every time you type a letter, there is a request 3. Request the address is https://fanyi.baidu.com/sug 4. The FormData value is kw:girl 5. Json package '' ==>
         
         from urllib import request, parse
         Handle jSON-formatted modules
         import json
         
         The general process is as follows: 1. Use data to construct content, and then urlopen to open 2. Return a result in json format 3. The result should be the definition "" of girl.
         
         baseurl = 'https://fanyi.baidu.com/sug'
         
         The data stored to simulate forms must be in dict format
         data = {
             # girl is translated into English and should be entered by the user, in this case hard coded
             'kw': 'girl'
         }
         
         Data needs to be encoded using the parse module
         data = parse.urlencode(data).encode('utf-8')
         
         We need to construct a request header that should contain at least the length of the incoming data
         # request requires that the incoming request header be in dict format
         
         headers = {
             Because post requests are used, at least the Content-Length field should be included
             'Content-Length':len(data)
         }
         
         Construct an instance of Request
         
         req = request.Request(url=baseurl, data=data, headers=headers)
         
         Since a Request instance has been constructed, all Request information can be encapsulated in the Request instance
         rsp = request.urlopen(req)
         
         json_data = rsp.read().decode('utf-8')
         print(type(json_data))
         print(json_data)
         
         Convert json strings into dictionaries
         json_data = json.loads(json_data)
         print(type(json_data))
         print(json_data)
         
         for item in json_data['data'] :print(item['v']."--", item['v'])
        Copy the code
        <class 'str'> {"errno":0,"data":[{"k":"girl","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f; \u5e74\u8f7b\u5973\u5b50;  \u5973\u90ce;"},{"k":"girls","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f; \u5e74\u8f7b\u5973\u5b50; \u5973\u90ce;  girl\u7684\u590d\u6570;"},{"k":"girlfriend","v":"n. \u5973\u670b\u53cb; \u5973\u60c5\u4eba;  (\u5973\u5b50\u7684)\u5973\u4f34\uff0c\u5973\u53cb;"},{"k":"girl friend","v":" \u672a\u5a5a\u59bb;  \u5973\u6027\u670b\u53cb; "},{"k":"Girls' Generation","v":" \u5c11\u5973\u65f6\u4ee3\uff08\u97e9\u56fdSM\u5a31\u4e50\u6709\u9650\u516c\u53f8\u4e8e2007\u5e74\u63a8\u51fa\u7684\u4e5d \u540d\u5973\u5b50\u5c11\u5973\u7ec4\u5408\uff09;"}]} <class 'dict'> {'errno': 0, 'data': [{'k': 'girl', 'v': 'n. girl; girl; her daughter; young woman; the girl;'}, {' k ':' girls', 'v' : 'n. girl; girl, daughter, a young woman; girl; girl's plural;'}, {' k ':' girlfriend ', 'v' : 'n. Girlfriend; female lover; girl (woman), his girlfriend;'}, {' k ':' girl friend ', 'v' : 'fiancee; women;}, {' k' : "Girls' Generation", "v" : 'Girls' Generation' (a nine-girl group launched by SM Entertainment In 2007) Girl; The girl; Daughter; A young woman; Girl; -- n. girl; The girl; Daughter; A young woman; Girl; N. girl; The girl; Daughter; A young woman; Girl; Girl's plural; -- n. girl; The girl; Daughter; A young woman; Girl; Girl's plural; A girlfriend; Female lover; (a woman's) companion -- N. girlfriend; Female lover; (a woman's) companion His fiancee. Female friends; -- Fiancee; Female friends; Girls' Generation (a nine-girl group launched by SM Entertainment In 2007); -- Girls' Generation (a nine-girl group launched by SM Entertainment In 2007);Copy the code

Finally, welcome to pay attention to my personal wechat public account “Little Ape Ruochen”, get more IT technology, dry goods knowledge, hot news