Thank you for reading this article, which contains my notes during the learning process. I hope it will be of some help to readers. If you find any mistakes or better suggestions in the reading process, please give me feedback in time to ensure the accuracy and legibility of the content.

1 overview

A web crawler (also known as a web spider, web bot, or more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules.

To put it simply, crawler is a process that simulates the browser to initiate a request, and then extracts useful information from the resources (HTML, JSON, etc.) responded by the server, and then saves it.


2 Principle of crawler

Simply put, a crawler is an automated program that obtains web pages and extracts and saves information.

As can be seen from the figure above, our crawler program is actually composed of three steps:

  1. Access to web pages

    Get web page, in fact, is to simulate the browser access to obtain the source code of the web page. Python provides many libraries to do this, such as URllib, requests, and so on.

  2. Extracting information

    Extracting information, that is, extracting useful data from the source code of a web page. Python provides many libraries to do this, such as RE, xpath, BS4, and so on.

  3. Save the data

    Saving data is the operation of persisting the useful information we extract. For example, save as TXT, JPG, etc., can also save to the database (MySQL, Redis, etc.).

    By implementing the above three steps, we have actually completed our simple crawler program. This crawler can take our place to automatically crawl some useful information.