Introduction to Crawler principles and data crawling (part 1)- General crawler and Focused crawler

According to the application scenario, web crawler can be divided into general crawler and focused crawler.

General crawler

General web crawler is an important part of search engine search system (Baidu, Google, Yahoo, etc.). The main purpose is to download web pages from the Internet to the local, forming a mirror backup of Internet content.

General Search Engine (Search Engine) works

General web crawler collects web pages from the Internet and collects information, which is used to build index for search engine to provide support. It determines whether the content of the whole engine system is rich and whether the information is real-time. Therefore, its performance directly affects the effect of search engine.

Step 1: Grab web pages

The basic working process of search engine web crawler is as follows:

Firstly, select some seed urls and put these urls into the URL queue to be captured.
Take out the URL to be captured, resolve DNS to get the IP of the host, and download the corresponding web page of the URL, store it in the downloaded web page library, and put these urls into the captured URL queue.
Analyze the URL in the captured URL queue, analyze the other urls in it, and put the URL into the URL queue to be captured, thus entering the next loop…

How do search engines get the URL of a new website:

The new website to active sites submitted: (such as search engine baidu zhanzhang.baidu.com/linksubmit/…
Set up new external links on other sites (as far as possible within the scope of search engine crawlers)
Search engines work with DNS resolution providers (such as DNSPod), and new web domain names are captured quickly.

However, the crawling of the search engine spider is entered with certain rules. It needs to comply with some commands or contents of files, such as the link marked as Nofollow or the Robots protocol.

> Robots Protocol (also known as crawler Protocol, robot Protocol, etc.), fully known as "Web Crawler Exclusion Protocol", a website can tell search engines which pages can be captured and which pages cannot be captured by Robots Protocol, such as taobao: Tencent: https://www.taobao.com/robots.txt http://www.qq.com/robots.txtCopy the code

Step 2: Data storage

The search engine crawls the web page and stores the data into the original page database. The page data is exactly the same as the HTML received by the user’s browser.

Search engine spiders in crawling pages, also do a certain amount of repeated content detection, once encountered access to a very low weight of the site has a lot of plagiarism, collection or copy of the content, is likely to no longer crawling. ,

Step 3: Pretreatment

The search engine crawls the page and preprocesses it in various steps.

Extracting text
Chinese word segmentation
Eliminate noise (such as copyright text, navigation bars, advertisements, etc.)
Index processing
Link relation calculation
Special file handling
…

In addition to HTML files, search engines can usually crawl and index a variety of text-based file types, such as PDF, Word, WPS, XLS, PPT, TXT files, etc. We often see these file types in search results as well.

But search engines can’t yet handle non-text content like images, videos, Flash, or execute scripts and programs.

Step 4: provide search services, site ranking

After organizing and processing the information, the search engine provides users with keyword retrieval services and presents the relevant information to users.

At the same time will be according to the page PageRank value (link traffic ranking) to carry out the site ranking, so Rank value of the high site in the search results will be ranked before, of course, can also directly use Money to buy search engine site ranking, simple and crude.

However, these universal search engines also have certain limitations:

The results returned by a generic search engine are web pages, and in most cases, 90% of the content is useless to the user.
Users in different fields and backgrounds often have different search purposes and requirements, and search engines cannot provide search results for a specific user.
With the development of web data forms and network technology, pictures, databases, audio and video multimedia and other different data appear in large quantities. The general search engine is incapable of doing anything to these files and cannot find and obtain them well.
Most general search engines provide keyword-based retrieval, which is difficult to support queries based on semantic information, and cannot accurately understand the specific needs of users.

In view of these situations, focused crawler technology is widely used.

Focused crawler

Focused crawler is a kind of web crawler “oriented to specific subject requirements”. The difference between it and general search engine crawler lies in that focused crawler will process and filter the content in the implementation of web page crawler, and try to ensure that only the web page information related to the requirements is captured. And the web crawler that we want to learn in the future is focused crawler.

I am white and white I, a program yuan like to share knowledge ❤️

If you don’t know how to program or want to learn how to program, you can leave a message on this blog. Thank you very much for your likes, favorites, comments, one-click support.

Introduction to Crawler principles and data crawling (part 1)- General crawler and Focused crawler

General crawler

General Search Engine (Search Engine) works

However, these universal search engines also have certain limitations:

Focused crawler

Related Posts

Some common problems with PHP debug environment setup

MySQL > lock MySQL > lock

How to write the correct program from binary search algorithm