This article is participating in “Python Theme Month”. Check out the details: Python Writing Season, show your Python articles – $2000 limited prize waiting for you to win

The crawler is introduced

Crawler: a program that simulates a client to send a network request, receive the data corresponding to the request, and automatically grab Internet information according to certain rules. In principle, a crawler can do anything a client (browser) can do. Generally speaking, it is to grab web data, such as everyone’s favorite girl picture, small video, as well as e-books, text comments, product details, etc., can be crawled down through the crawler.

The working principle of

1. Obtain the initial URL. The initial URL address can be manually specified or determined by one or more initial crawling pages specified by the user.

2. Climb the page based on the original URL and obtain a new URL. After obtaining the initial URL address, first climb the current URL address of the web page information, and then parse the web page information content, the web page stored in the original database, and in the current obtained web page information found a new URL address, stored in a URL queue.

3. Read a new URL from the URL queue to obtain new web page information, obtain a new URL from the new web page, and repeat the preceding crawling process.

4. Stop crawling when the stop conditions set by the crawler system are met. When writing crawler, the corresponding stop condition is generally set, and crawler will stop crawling when the stop condition is met. If the stop condition is not set, the crawler will continue to crawl until it cannot get a new URL.

The search engine crawls the web page and stores the data into the original page database. The page data is exactly the same as the HTML received by the user’s browser. Search engine spiders in crawling pages, also do a certain amount of repeated content detection, once encountered access to a very low weight of the site has a lot of plagiarism, collection or copy of the content, is likely to no longer crawling.

In addition to HTML files, search engines can usually crawl and index a variety of text-based file types, such as PDF, Word, WPS, XLS, PPT, TXT files, etc. We often see these file types in search results as well. After organizing and processing the information, the search engine provides users with keyword retrieval services and presents the relevant information to users.

General features of crawler:

It crawls a large (generally infinite) number of sites rather than a specific number of sites. It won’t crawl the entire site, because that’s very impractical (or impossible) to do. Instead, it limits the time and number of crawls. It is logically simple (compared to complex spiders with many extraction rules), and the data is post-processed at another stage by crawling a large number of sites in parallel to avoid being limited by the limitations of one site (each site is crawled slowly but many sites are crawled simultaneously to show respect).

Crawler limitations:

The results returned by a generic search engine are web pages, and in most cases, 90% of the content is useless to the user.

Users in different fields and backgrounds often have different search purposes and requirements, and search engines cannot provide search results for a specific user.

With the rich forms of network data and the continuous development of network technology, a large number of different data such as pictures, databases, audio and video multimedia appear. The general search engine is powerless to find and obtain these files well.

Most general search engines provide keyword-based retrieval, which is difficult to support queries based on semantic information, and cannot accurately understand the specific needs of users.

Tools for crawlers

Request libraries: Requests, Selenium (Can drive the browser to parse and render CSS and JS, but has a performance disadvantage. Useful and useless web pages will load.

Parsing libraries: Re, Beautifulsoup, PyQuery.

Repositories: files, MySQL, Mongodb, Redis.