“This is the 16th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

First, the legitimacy of reptiles

At present, the basic order of “what is allowed” is still under construction.

At least for now, if the data is captured for personal use, there is no problem; If the data is used for reprinting, then

The type of data that is captured is important: in general, when the data that is captured is real data (e.g., business address, phone list), it is allowed to be reproduced, but if it is original data (e.g., opinion or comment), copyright restrictions usually prevent it from being reproduced.

Discussion: The legality of baidu crawler’s data capture behavior.

****** Note: However, as a visitor, you should restrict your fetching behavior, which means that the speed of download requests should be limited to a reasonable value, and that you should have a dedicated user agent to identify yourself.

Two, the preparation of crawlers: website background research

The background research of the website is very important for the focused web crawler. As the saying goes: Know yourself and know your enemy, you can win a hundred battles.

1, the robots agreement

The full name of Robots Protocol (also known as crawler Protocol, robot Protocol, etc.) is “Web Crawler Exclusion Protocol”, through which websites tell search engines which pages can and cannot be crawled.

* * * * such as:

Taobao:www.taobao.com/robots.txt

**** another example:

www.douban.com/robots.txt

Search engines work with DNS resolution providers (such as DNSPod), and new web domain names are captured quickly. However, the crawling of the search engine spider is entered with certain rules. It needs to comply with some commands or contents of files, such as the link marked as Nofollow or the Robots protocol. The other is to submit the site’s address to the search engine through the webmaster of the site, and the search engine will then send “spiders” to crawl the site.

2. Sitemap

Sitemap is a container for all the links to a website. Many websites link layers are deep, spiders are hard to crawl to the site map can convenient search engine spiders crawl web pages, through the scraping of the web page, clearly understand the site structure, site map generally stored in the root directory and named a sitemap, for search engine spiders directions, increase website important content of the page. Site map is a navigation page file generated according to the structure, frame and content of the site. Most people know that sitemaps are good for improving the user experience: they give visitors directions and help lost visitors find the page they want to see.

Sitemap is available in two forms:

A.html: A sitemap is the HTML version of a sitemap that a user can see on a site, listing links to all the major pages on the site. For small sites, and even can list all of the pages of the entire site, with the size of the site, a site map can’t list all page links, can adopt two ways, one way is site map list only the most important link, such as the level of classification, secondary classification, the second method is the site map into several files, The main sitemap lists links to secondary sites, while the secondary sitemap lists links to some of the pages.

B.X ML: The XML version of the sitemap is composed of XML tags, and the file itself must be utF8 encoding. The XML version of the sitemap is composed of XML tags. Sitemap file is actually a list of the site needs to be included in the page URL, the simplest sitemap can be a plain text, as long as the file lists the page URL, a row of a URL, the search engine can grab and understand the content of the file.

You can use this site tool to generate a sitemap for a site: www.sitemap-xml.org

3, Estimate the size of your website

You can use a search engine to do this, such as using site in Baidu:

Note: The size of the website is estimated roughly through baidu search engine, which is limited by the website itself to the search engine crawler and the search engine itself to crawl data technology. Therefore, this is only an experience value, which can be used as an experience value to estimate the website volume.

4, Identify what technology the site uses

In order to better understand the website, crawl the website information, we can first understand the website roughly used by the technical architecture.

Builtwith installation:

Windows: PIP install bulitwith

Linux:     sudo pip install builtwith

Use: In the Python interactive environment, type:

import builtwith

builtwith.parse("http://www.sina.com.cn")
Copy the code

5, Find the owner of the site

Sometimes, we need to know who owns a website, and here’s a technically simple way to do it.

Install python – whois:

Windows: PIP install python-whois

Use: In the Python interactive environment, type:

import whois

whois.whois("http://www.sina.com.cn")
Copy the code