[Python3 Web crawler development combat] 3.1.4- Analysis of Robots protocol

Using the RobotParser module of URllib, we can realize the analysis of website Robots protocol. In this section, we take a brief look at the use of this module.

1. The agreement of Robots

Robots Protocol is also known as crawler Protocol and robot Protocol. Its full name is Web Crawler Exclusion Protocol, which is used to tell crawlers and search engines which pages can and cannot be crawled. It is usually a text file called robots.txt, which is placed at the root of the site.

When a search crawler visits a site, it will first check whether there is a robots.txt file in the root directory of the site. If there is, the crawler will crawl according to the crawl range defined therein. If the file is not found, the search crawler visits all directly accessible pages.

Let’s look at an example of robots.txt:

User-agent: *
Disallow: /
Allow: /public/
Copy the code

This allows all search crawlers to only crawl the public directory. Save the above content in a robots.txt file in the root directory of the site, along with the site entry files such as index.php, index.html, index.jsp, etc.

The user-agent above describes the name of the search crawler, and setting it to * here means that the protocol is valid for any crawler. For example, we could set:

User-agent: Baiduspider
Copy the code

This means that the rules we set are effective for Baidu crawlers. If there are multiple User-Agent records, there will be multiple crawlers subject to crawl restrictions, but at least one of them needs to be specified.

Disallow specifies the directory that cannot be fetched. For example, if the value is set to/in the preceding example, all pages cannot be fetched.

Allow and Disallow are used together, not alone, to exclude certain restrictions. Now we set it to /public/, which means that all pages are not allowed to be fetched, but the public directory is.

Let’s look at some more examples. The code forbidding all crawlers from accessing any directory is as follows:

User-agent: * 
Disallow: /
Copy the code

The code that allows all crawlers to access any directory is as follows:

User-agent: *
Disallow:
Copy the code

It is also possible to leave the robots.txt file blank.

The code that prevents all crawlers from accessing certain directories on the site is as follows:

User-agent: *
Disallow: /private/
Disallow: /tmp/
Copy the code

Only one crawler is allowed to access the following code:

User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /
Copy the code

These are some common ways to write robots.txt.

2. Crawler name

You may be wondering, where did the reptile name come from? Why is it called that? In fact, it has a fixed name, for example, Baidu is called BaiduSpider. Table 3-1 lists the names and websites of some common crawlers.

Table 3-1 Names of common crawlers and their corresponding websites

3. robotparser

Once we understand the Robots protocol, we can use the RobotParser module to parse robot.txt. The module provides a class RobotFileParser, which can determine whether a crawler has permission to crawl a web page based on the robots.txt file of a site.

This class is very simple to use, just need to pass the robots.txt link in the constructor. First take a look at its statement:

urllib.robotparser.RobotFileParser(url=' ')
Copy the code

You can use set_URL () at the end of the declaration without passing it.

The following lists several methods that are commonly used by this class.

set_url(): Used to set the link to the robots.txt file. If you are creatingRobotFileParserObject, so you don’t need to use this method.
read(): Read the robots. TXT file and analyze it. Note that this method performs a read and parse operation, and if this method is not called, the subsequent judgments will beFalse, so remember to call this method. This method does not return anything, but does a read.
Parse () : It is used to parse the robots.txt file. The parameter passed is the content of some lines in robots.txt. It will analyze these contents according to the syntax rules of robots.txt.
can_fetch(): The method takes two arguments, the first beingUser-agentThe second is the URL to grab. The result is whether the search engine can crawl the URLTrueorFalse.
Mtime () : returns the last time to crawl and analyze the robots.txt, which is necessary for a long time to analyze and crawl the search crawler, you may need to check regularly to crawl the latest robots.txt.
Modified () : It is also useful for long time analysis and crawling search crawlers, setting the current time to the last time to crawl and analyze robots.txt.

Here’s an example:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch(The '*'.'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch(The '*'."http://www.jianshu.com/search?q=python&page=1&type=collections"))
Copy the code

In this example, the RobotFileParser object is first created, and the link of robots.txt is set through the set_url() method. Of course, if you don’t use this method, you can use the following method when declaring:

rp = RobotFileParser('http://www.jianshu.com/robots.txt')
Copy the code

Then the can_fetch() method is used to determine whether the web page can be fetched.

The running results are as follows:

True
False
Copy the code

The parser() method can also be used to read and parse, as shown in the following example:

from urllib.robotparser import RobotFileParser
from urllib.request import urlopen

rp = RobotFileParser()
rp.parse(urlopen('http://www.jianshu.com/robots.txt').read().decode('utf-8').split('\n'))
print(rp.can_fetch(The '*'.'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch(The '*'."http://www.jianshu.com/search?q=python&page=1&type=collections"))
Copy the code

The result is the same:

True
False
Copy the code

This section introduces the basic usage and examples of RobotParser module, using it, we can easily determine which pages can be fetched, which pages can not be fetched.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)

[Python3 Web crawler development combat] 3.1.4- Analysis of Robots protocol

1. The agreement of Robots

2. Crawler name

3. robotparser

Related Posts

The general process of source code interpretation of Nacos 1.3.1

IVI algorithm framework 2.0 for industrial vision intelligent practical experience

Do you know what G0 and M0 are?