In this tutorial, you will learn how to get started with a [Python] crawler. You can learn how to write a simple Python crawler in 30 minutes by following the article and understanding the implementation code.

This Python crawler tutorial covers the following five parts:

  1. Understand web pages;
  2. Fetching web site data using the Requests library;
  3. Parse web pages with Beautiful Soup;
  4. Cleaning and organizing data;
  5. Reptilian attack and defense;

Understand the web page

Take the home page of China Tourism network as an example, grab the first piece of information on the home page of China Tourism network (title and link), data in the form of clear text in the source code. On the home page of China Travel Net, press the shortcut key [Ctrl+U] to open the source page, as shown in Figure 1.

Figure 1 China Tourism network home page source

Understanding web structure

Web pages generally consist of three parts, NAMELY HTML (Hypertext Markup Language), CSS (Cascading style Sheets) and JScript (Active Scripting Language).

HTML

HTML is the structure of the entire web page, equivalent to the frame of the entire website. Tags with “<” and “>” symbols are HTML tags that come in pairs. Finally, if your time is not very tight, and want to quickly improve, the most important thing is not afraid of hardship, I suggest you can contact Wei: 762459510, that is really good, many people progress quickly, need you not afraid of hardship oh! You can go to add a look at ~

Common labels are as follows:

.

.

  • .
  • .

    .

    CSS

    CSS represents styling, and line 13 in Figure 1 < style type= “text/ CSS” > indicates that a CSS is referenced below, in which the appearance is defined.

    JScript

    JScript represents functionality. The interactive content and various special effects are in JScript, which describes the various functions in the site.

    Using the human body as an analogy, HTML is the skeleton of the human body and defines where the mouth, eyes, ears, and so on should be. CSS is the appearance details of a person, such as what the mouth looks like, whether the eyes are double or single eyelid, whether the eyes are large or small, whether the skin is black or white, etc. JScript represents human skills, such as dancing, singing, or playing Musical Instruments.

    Write a simple HTML

    You can better understand HTML by writing and modifying it. First open a Notepad and type the following:

    Getting started with crawlers and data cleaning in Python 3

    • The crawler
    • Data cleaning

    After entering the code, save notepad, then change the file name and suffix to “html.html”;

    The result of running this file is shown in Figure 2.

    Figure 2

    This code just uses HTML, and the reader can modify the Chinese in the code and observe the changes.

    About the legitimacy of reptiles

    Almost every website has a document named robots.txt, of course, there are some websites do not set robots.txt. For the website without robots.txt, the data without password encryption can be obtained through the web crawler, that is, all the page data of the website can be crawled. If the site has a robots.txt document, it is necessary to determine whether there are forbidden visitors to obtain data. Finally, if your time is not very tight, and want to quickly improve, the most important thing is not afraid of hardship, I suggest you can contact Wei: 762459510, that is really good, many people progress quickly, need you not afraid of hardship oh! You can go to add a look at ~

    Take Taobao.com as an example, visit it in a browser, as shown in Figure 3.

    Figure 3 Contents of robots.txt file of Taobao.com

    Taobao allows some crawlers to access some of its paths, but forbids all crawlers for users without permission. The code is as follows:

    User-Agent:* Disallow:/

    This code means that no crawler is allowed to crawl any data other than the crawler specified earlier.

    Request the web site using the Requests library

    Install requests library

    First install the Requests library in PyCharm, open PyCharm, click on the “File” menu, and select “Setting for New Projects…” Command, as shown in Figure 4.

    Figure 4.

    Select the “Project Interpreter” command to confirm the currently selected compiler, and then click the plus sign in the upper right, as shown in Figure 5.

    Figure 5

    Enter: Requests in the search box, and click the Install Package button in the lower left corner. As shown in Figure 6:

    Figure 6.

    When the installation is complete, “Package ‘requests’ installed successfully” will appear on Install Package, as shown in Figure 7. If the installation fails, a message will be displayed.

    Figure 7 Successful installation

    The basic principles of reptiles

    The web page request process is divided into two parts:

    1. Request: Every web page presented to the user must go through this step, which is to send a Request for access to the server.
    2. Response: After receiving the user’s request, the server will verify the validity of the request and then send the Response content to the user (client). The client receives the Response content from the server and displays the content, which is the familiar webpage request, as shown in Figure 8.

    Fig.8 Response Response

    There are two ways to request a web page:

    1. GET: The most common method used to obtain or query resource information. It is also used by most websites and has fast response speed.
    2. POST: Compared with GET, the function of uploading parameters in the form of a form is more. Therefore, in addition to querying information, you can also modify information.

    Therefore, before writing the crawler, it is necessary to determine who to send the request to and how to send it.

    Use GET to capture data

    Copy the title of the first news on any home page, press [Ctrl+F] to bring up the search box on the source page, paste the title in the search box, and then press [Enter].

    As shown in Figure 8, the title is searchable in the source code, the request object is www.cntour.cn, and the request method is GET (all data requests in the source code are GET), as shown in Figure 9. Finally, if your time is not very tight, and want to quickly improve, the most important thing is not afraid of hardship, I suggest you can contact Wei: 762459510, that is really good, many people progress quickly, need you not afraid of hardship oh! You can go to add a look at ~

    Figure 9 (Click here for a larger hd view)

    Once you have determined who and how to request, enter the following code in PyCharm:

    Url = 'http://www.cntour.cn/' 3. STRHTML = requests. Get (url) print(strhtml.text)

    The running results are shown in Figure 10:

    FIG. 10 Effect picture of running results (click here to view large hd picture)

    The statement used to load the library is import+ the library name. In the above process, the statement that loads the Requests library is: import Requests.

    To GET the data in GET mode, call the Requests library GET method by typing a dot after requests, as shown below:

    requests.get

    Store the obtained data in the STRHTML variable as follows:

    strhtml = request.get(url)

    At this point, STRHTML is a URL object that represents the entire web page, but all you need is the source code in the page. The following statement represents the source code for the page:

    strhtml.text

    Capture data in POST mode

    First, enter the website of Youdao Translation: Fanyi.Netease to enter the page of Youdao Translation.

    Press the shortcut key F12 to enter the developer mode, click Network, and the content will be empty, as shown in Figure 11:

    Figure 11.

    Enter “I love China” in Youdao Translation and click the “Translate” button, as shown in Figure 12:

    Figure 12

    In Developer mode, click the “Network” button and then the “XHR” button to find the translation data, as shown in Figure 13:

    Figure 13

    Click Headers and discover that the request for data is POST. As shown in Figure 14:

    Figure 14

    Once you’ve figured out where the data is and how to request it, it’s time to write the crawler.

    First, copy the URL from Headers and assign it to the URL as follows:

    Url = ‘fanyi.youdao.com/translate_o… ‘

    POST requests GET data in a way that differs from GET in that they must build a request header.

    The request parameters in Form Data are shown in Figure 15:

    Figure 15

    Copy it and build a new dictionary:

    From_data = {‘ I ‘:’ I love China, ‘from’ : ‘useful – CHS’, ‘to’ : ‘en’, ‘smartresult’ : ‘dict’, ‘client’ : ‘fanyideskweb’, ‘salt’, ‘15477056211258’, ‘si gn’:’b3589f32c38bc9e3876a570b8a992604′,’ts’:’1547705621125′,’bv’:’b33a2f3f9d09bde064c9275bcb33d94e’,’doctype’:’json’,’ve Rsion ‘:’ 2.1 ‘, ‘keyfrom fanyi.’ : ‘web’, ‘action’ : ‘FY_BY_REALTIME’, ‘typoResult’ : ‘false’}

    Next use the requests. Post method to request the form data as follows:

    Response = requests. Post (URL,data=payload)

    Convert string format data into JSON format data, extract data according to [data structure], and print the translation result, the code is as follows:

    1.  import json
    2.  content = json.loads(response.text)
    3.  print(content['translateResult'][0][0]['tgt'])
    
    

    The complete code to grab the Youdao translation results using the requests. Post method is as follows:

    3. Def get_translate_date(word=None): 4. url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule' 5. From_data={'i':word,'from':'zh-CHS','to':'en','smartresult':'dict','client':'fanyideskweb','salt':'15477056211258','sign ':'b3589f32c38bc9e3876a570b8a992604','ts':'1547705621125','bv':'b33a2f3f9d09bde064c9275bcb33d94e','doctype':'json','vers Ion ':'2.1','keyfrom':'fanyi. Web ',' Action ':'FY_BY_REALTIME','typoResult':'false'} 6 Requests. Post (url,data=From_data) 10. Print (content) 11. # # 12. Print the translated data print (content [' translateResult] [0] [0] [' TGT ']) 13. If __name__ = = "__main__ ': 14. Get_translate_date (' I love China ')

    Parse web pages using Beautiful Soup

    Having captured the source code of the web page through the Requests library, the next step is to find and extract the data from the source code. Beautiful Soup is a Python library whose primary function is to fetch data from web pages. Beautiful Soup has now been ported to the BS4 library, which means you need to install the BS4 library before importing Beautiful Soup.

    Figure 16 shows how to install the BS4 library:

    Figure 16

    After installing the BS4 library, you need to install the LXML library. If we do not install the LXML library, we will use Python’s default parser. Although Beautiful Soup supports both the HTML parser in the Python standard library and some third-party parsers, the LXML library is much more powerful and faster, so I recommend installing it.

    Once you have installed the Python third-party library, enter the following code to start your Beautiful Soup journey:

    From bS4 import BeautifulSoup 3. Url ='http://www.cntour.cn/' 4. strhtml=requests.get(url) 5. soup=BeautifulSoup(strhtml.text,'lxml') 6. data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a') 7. print(data)

    The code results are shown in Figure 17.

    The Beautiful Soup library, which makes it easy to parse web page information, is integrated into the BS4 library and can be called from the BS4 library when needed. Its expression is as follows:

    from bs4 import BeautifulSoup

    First, the HTML document is converted to Unicode encoding, then Beautiful Soup selects the most appropriate parser to parse the document, specifying the LXML parser to parse. Parsing transforms a complex HTML document into a tree structure, and each node is a Python object. Here we store the parsed document in the newly created variable SOUP as follows:

    soup=BeautifulSoup(strhtml.text,’lxml’)

    Next, use select to locate the data. To locate the data, use the developer mode of the browser, hover the mouse cursor over the corresponding data position and right click, then select “Check” from the shortcut menu, as shown in Figure 18:

    Figure 18

    The developer screen pops up on the right side of the browser, with code highlighted on the right (see Figure 19(b)) corresponding to data text highlighted on the left (see Figure 19(a)). Right-click to highlight data on the right, and choose the “Copy” Selector command from the shortcut menu that pops up to automatically Copy paths.

    Figure 19 Replication path

    Paste the path into the document as follows:

    #main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li:nth-child(1) > a

    Since this path is the path to the first one selected and we need to get all the headlines, we delete the part after the colon (including the colon) in li: nth-child (1) as follows:

    #main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a

    Use soup. Select to reference the path as follows:

    data = soup.select(‘#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a’)

    Clean and organize data

    At this point, you have the HTML code for the target, but you have not yet extracted the data. Enter the following code in PyCharm:

    1.  for item in data:
    2.  result={
    3.  'title':item.get_text(),
    4.  'link':item.get('href')
    5.  }
    6.  print(result)
    
    

    The code running results are shown in Figure 20:

    Figure 20

    The first data to be extracted is the title and the link. The title is in the < a > tag, and the body of the tag is extracted using the get_text() method. The href attribute of the < a > tag is extracted using the get() method. The attribute data to be extracted is specified in parentheses, that is, get(‘ href ‘).

    As you can see from Figure 20, there is a numeric ID in the link to the article. Let’s extract this ID using a regular expression. The following re symbols are used:

    \d Matches digit + matches previous character 1 or more times

    When calling regular expressions in Python, use the RE library, which does not need to be installed and can be called directly. Enter the following code in PyCharm:

    1.  import re
    2.  for item in data:
    3.  result={
    4.  "title":item.get_text(),
    5.  "link":item.get('href'),
    6.  'ID':re.findall('\d+',item.get('href'))
    7.  }
    8.  print(result)
    
    

    The running results are shown in Figure 21:

    Figure 21

    Here we use the findAll method of the RE library, where the first argument represents the regular expression and the second argument represents the text to extract.

    Reptilian warfare

    Crawler simulates the browsing behavior of human and crawls data in batches. When the amount of data captured gradually increases, it will cause great pressure to the accessed server, and even may crash. In other words, servers don’t like people grabbing their data. So, website respect can be aimed at these crawler, take some anti – crawl strategy.

    The first way for the server to recognize crawlers is to check the connected UserAgent to determine whether it is browser access or code access. If it is code access, the server will block the incoming IP when the traffic increases.

    So what should we do about this rudimentary anti-crawl mechanism?

    Again, take the example of the crawler you created earlier. In the developer environment, we can not only find the URL, Form Data, but also construct the browser Request headers to encapsulate itself. The server identifies the browser by determining whether the keyword is a user-agent under Request Headers, as shown in Figure 22.

    Figure 22

    Therefore, we only need to construct the parameters of the request header. Create a request header as follows:

    Headers = {‘ the user-agent ‘:’ Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/70.0.3538.110 Safari/537.36′} Response = request.get(url,headers=headers)

    At this point, many readers will think that modifying user-Agent is too easy. It is indeed very simple, but a normal person can see a picture in a second, while a crawler can grab many pictures in a second, such as hundreds of pictures in a second, so the pressure on the server will inevitably increase. That is to say, if the batch access to download pictures under an IP, this behavior does not conform to normal human behavior, must be blocked IP.

    Its principle is also very simple, is the statistics of each IP access frequency, the frequency exceeds the threshold, will return a verification code, if it is really a user access, the user will fill in, and then continue to access, if it is code access, will be blocked IP.

    There are two solutions to this problem, the first is commonly used to add delay, grab every 3 seconds, the code is as follows:

    import time time.sleep(3)

    However, the purpose of our crawler is to efficiently capture data in batches. The crawler is set to capture data once in 3 seconds, which is too low in efficiency. In fact, there is a more important solution, and that is to solve the problem at its core.

    Regardless of the access, the purpose of the server is to figure out what is code access and then block the IP. Solution: Proxies are often used in data collection to avoid IP packets being blocked. Of course, Requests has their own proxies property.

    First, build your own proxy IP pool, assign it to proxies as a dictionary, and then transfer it to Requests as follows:

    1. Proxies = {2. "HTTP" : "http://10.10.1.10:3128", 3 "HTTPS" : "http://10.10.1.10:1080", 4. } 5. response = requests.get(url, proxies=proxies)

    Further reading

    This article only gives a brief introduction to the Python crawler and its implementation process, which can only give beginners a superficial understanding of the Python crawler, but can not let you fully master the Python crawler.