background

Crawler is a fun thing that allows you to automatically grab information from the Web by crawler, eliminating a lot of manual operation. Prior to the development of good crawler frameworks, crawlers were built using simple web requests + Web parsers, such as Python’s Requests + BeautifulSoup, and more advanced crawlers included data storage modules. For example, MySQL and MongoDB. The development efficiency is low and the stability is not good in this way. It may take several hours to develop a complete and productive crawler. I call this approach non-frame crawlers.

In 2011, the Twisted based Scrapy crawler framework came out of nowhere and suddenly became known as the number one all-powerful high-performance crawler asynchronous framework. Scrapy abstracts several core modules, allowing developers to focus on crawler logic rather than data download, page parsing, and task scheduling. Developing a productive crawler can take as little as ten minutes for a simple one, or more than an hour for a complex one. Of course, we have many other great frameworks, such as PySpider, Colly, etc. I call this type of crawler a frame crawler. Frame crawler liberates productivity. Now many enterprises apply frame crawler in production environment after transformation to capture data on a large scale.

However, for the need to capture hundreds of thousands of web site crawlers, the framework crawler may be a bit of a spare effort, writing crawlers become manual work. For example, if it takes an average of 20 minutes to develop a frame crawler, and if a full-time crawler developer works 8 hours a day, it would take 20,000 minutes, 333 hours, 42 working days, or nearly 2 months to develop 1000 websites. Sure, we could hire 10 full time crawler developers, but that would also take 4 working days to complete (see picture below).

This, too, is inefficient. In order to overcome this efficiency problem, configurable crawler came into being.

Introduction to configurable crawlers

The Configurable crawler, as its name implies, is a crawler that can be configured with different grasping rules. Configurable crawler is a highly abstract crawler. The developer does not need to write crawler code, but only needs to write the address, field and attribute of the web page that needs to be captured in the configuration file or database, so that the special crawler can capture data according to the configuration. Configurable crawler abstracts the crawler code into configuration information, which simplifies the process of crawler development. The developer of crawler only needs to do the corresponding configuration to complete the development of crawler. Therefore, developers can write crawlers on a large scale through configurable crawlers (see figure below).

This approach makes it possible to crawl hundreds of websites, and a skilled crawler configurator can crawl 1,000 news websites a day. This is very important for enterprises with public opinion monitoring needs, because configurable crawler improves productivity, reduces unit working time cost, improves development efficiency, and facilitates subsequent public opinion analysis and artificial intelligence product development. Many companies develop their own configurable crawlers (it may be called something different, but it’s the same thing) and then hire some crawler configurators to configure crawlers.

There are not many free and open source configurable crawler frameworks on the market. One of the earliest is Gerapy, a crawler management platform that generates Scrapy project files according to configuration rules. Another relatively new configurable crawler framework is Crawlab (actually Crawlab is not a configurable crawler framework, but a highly flexible crawler management platform), which has been released in V0.4.0. There is also an interesting open source framework called Ferret based on Golang, which makes writing crawlers as easy as writing SQL. There are some other commercial products, but according to the feedback of users, they feel that the professional degree is not high and can not meet the production demand.

The emergence of configurable crawler is mainly due to the simple crawler mode, which is nothing more than the combination of list page + detail page (as shown below) or just list page. There are of course slightly more complex general crawlers, which can also be done through rule configuration.

Crawlab can be configured

Today we mainly introduce Crawlab’s configurable crawler. We covered it in this article, but we didn’t go into the details of how to apply it to the real world. Today, we are going to focus on it. If you are unfamiliar with the configurable crawler of Crawlabb, please refer to the configurable crawler documentation.

Can be configured reptile combat

All cases in the actual combat part are compiled and captured by the author using Crawlab’s official Demo platform with configurable crawler function, covering news, finance, cars, books, videos, search engines, programmer communities and other fields (see the picture below). Here are some examples. All examples are on the official Demo platform, and you can register your account to log in and view them.

Baidu (search for “Crawlab”)

Crawler address: crawlab.cn/demo#/spide…

The crawler configuration

Spiderfile

version: 0.44.
engine: scrapy
start_url: http://www.baidu.com/s?wd=crawlab
start_stage: list
stages:
- name: list
  is_list: true
  list_css: ""
  list_xpath: //*[contains(@class, "c-container")]
  page_css: ""
  page_xpath: //*[@id="page"]//a[@class="n"][last()]
  page_attr: href
  fields:
  - name: title
    css: ""
    xpath: .//h3/a
    attr: ""
    next_stage: ""
    remark: ""
  - name: url
    css: ""
    xpath: .//h3/a
    attr: href
    next_stage: ""
    remark: ""
  - name: abstract
    css: ""
    xpath: .//*[@class="c-abstract"]
    attr: ""
    next_stage: ""
    remark: ""
settings:
  ROBOTSTXT_OBEY: "false"
  USER_AGENT: Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit / 537.36 (KHTML,
    like Gecko) Chrome / 78.0.3904.108 Safari / 537.36
Copy the code

Grab the result

SegmentFault (Latest article)

Crawler address: crawlab.cn/demo#/spide…

The crawler configuration

Spiderfile

version: 0.44.
engine: scrapy
start_url: https://segmentfault.com/newest
start_stage: list
stages:
- name: list
  is_list: true
  list_css: .news-list > .news-item
  list_xpath: ""
  page_css: ""
  page_xpath: ""
  page_attr: ""
  fields:
  - name: title
    css: h4.news__item-title
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: url
    css: .news-img
    xpath: ""
    attr: href
    next_stage: ""
    remark: ""
  - name: abstract
    css: .article-excerpt
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
settings:
  ROBOTSTXT_OBEY: "false"
  USER_AGENT: Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit / 537.36 (KHTML,
    like Gecko) Chrome / 78.0.3904.108 Safari / 537.36
Copy the code

Grab the result

Amazon China (Search “mobile phone”)

Crawler address: crawlab.cn/demo#/spide…

The crawler configuration

Spiderfile

version: 0.44.
engine: scrapy
start_url: https://www.amazon.cn/s?k=%E6%89%8B%E6%9C%BA&__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&ref=nb_sb_noss_2
start_stage: list
stages:
- name: list
  is_list: true
  list_css: .s-result-item
  list_xpath: ""
  page_css: .a-last > a
  page_xpath: ""
  page_attr: href
  fields:
  - name: title
    css: span.a-text-normal
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: url
    css: .a-link-normal
    xpath: ""
    attr: href
    next_stage: ""
    remark: ""
  - name: price
    css: ""
    xpath: .//*[@class="a-price-whole"]
    attr: ""
    next_stage: ""
    remark: ""
  - name: price_fraction
    css: ""
    xpath: .//*[@class="a-price-fraction"]
    attr: ""
    next_stage: ""
    remark: ""
  - name: img
    css: .s-image-square-aspect > img
    xpath: ""
    attr: src
    next_stage: ""
    remark: ""
settings:
  ROBOTSTXT_OBEY: "false"
  USER_AGENT: Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit / 537.36 (KHTML,
    like Gecko) Chrome / 78.0.3904.108 Safari / 537.36
Copy the code

Grab the result

V2ex

Crawler address: crawlab.cn/demo#/spide…

The crawler configuration

Spiderfile

version: 0.44.
engine: scrapy
start_url: https://v2ex.com/
start_stage: list
stages:
- name: list
  is_list: true
  list_css: .cell.item
  list_xpath: ""
  page_css: ""
  page_xpath: ""
  page_attr: href
  fields:
  - name: title
    css: a.topic-link
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: url
    css: a.topic-link
    xpath: ""
    attr: href
    next_stage: detail
    remark: ""
  - name: replies
    css: .count_livid
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
- name: detail
  is_list: false
  list_css: ""
  list_xpath: ""
  page_css: ""
  page_xpath: ""
  page_attr: ""
  fields:
  - name: content
    css: ""
    xpath: .//*[@class="markdown_body"]
    attr: ""
    next_stage: ""
    remark: ""
settings:
  AUTOTHROTTLE_ENABLED: "true"
  ROBOTSTXT_OBEY: "false"
  USER_AGENT: Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit / 537.36 (KHTML,
    like Gecko) Chrome / 79.0.3945.117 Safari / 537.36
Copy the code

Grab the result

36 kr

Crawler address: crawlab.cn/demo#/spide…

The crawler configuration

Spiderfile

version: 0.44.
engine: scrapy
start_url: https://36kr.com/information/web_news
start_stage: list
stages:
- name: list
  is_list: true
  list_css: .kr-flow-article-item
  list_xpath: ""
  page_css: ""
  page_xpath: ""
  page_attr: ""
  fields:
  - name: title
    css: .article-item-title
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: url
    css: body
    xpath: ""
    attr: href
    next_stage: detail
    remark: ""
  - name: abstract
    css: body
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: author
    css: .kr-flow-bar-author
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: time
    css: .kr-flow-bar-time
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
- name: detail
  is_list: false
  list_css: ""
  list_xpath: ""
  page_css: ""
  page_xpath: ""
  page_attr: ""
  fields:
  - name: content
    css: ""
    xpath: .//*[@class="common-width content articleDetailContent kr-rich-text-wrapper"]
    attr: ""
    next_stage: ""
    remark: ""
settings:
  ROBOTSTXT_OBEY: "false"
  USER_AGENT: Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit / 537.36 (KHTML,
    like Gecko) Chrome / 78.0.3904.108 Safari / 537.36
Copy the code

Grab the result

Live crawler at a glance

The crawler name The crawler type
baidu List page + pagination
SegmentFault List of pp.
CSDN List page + paging + Details page
V2ex List page + Details page
Vertical and horizontal List of pp.
Amazon China List page + pagination
Snowball network List page + Details page
The home of the car List page + pagination
Douban reading List of pp.
36 kr List page + Details page
Tencent video List of pp.

conclusion

Crawlab configurable crawler is very convenient, so that programmers can quickly configure their own needs of the crawler. It took the author less than 40 minutes to configure the above 11 crawlers (considering the anti-crawler debugging), among which several simple crawlers were configured in less than 1-2 minutes. And the author did not write a line of code, all the configuration is completed on the interface. In addition, Crawlab configurable crawlers support not only configuration on the interface, but also writing a Yaml Spiderfile to complete configuration (in fact, all configuration can be mapped to Spiderfile). Crawlab configurable crawlers are based on Scrapy and therefore support most of the features of Scrapy. You can configure the extended properties of configurable crawlers, including USER_AGENT, ROBOTSTXT_OBEY, and so on. Why use Crawlab as the first choice for configurable crawlers? Crawlab configurable crawler can not only configure crawler, but also enjoy the core functions of Crawlab crawler management platform, including task scheduling, task monitoring, scheduled task, log management, message notification and other practical functions. In the subsequent development, Crawlab development team will continue to improve the configurable crawler, allowing it to support more functions, including dynamic content, more engines, CrawlSpider implementation and so on.

It should be noted that failure to comply with robots.txt may cause legal risks. The actual combat reptiles in this paper are used for learning and communication, and should not be used as a production environment. Any abusers should bear legal liabilities by themselves.

reference

  • Making: github.com/crawlab-tea…
  • Demo: crawlab.cn/demo
  • Documents: the docs. Crawlab. Cn /

If you think Crawlab is helpful to your daily development or company, please add the author on wechat tikazyQ1 and mark “Crawlab”, and the author will pull you into the group. Welcome to star on Github, and feel free to issue on Github if you have any problems. In addition, you are welcome to contribute to Crawlab development.