The editor has collected some of the more efficient Python crawler frameworks. Share it with everyone. [Python learning Materials]

1.Scrapy

Scrapy is an application framework designed to crawl site data and extract structured data. It can be used in a range of applications including data mining, information processing or storing historical data. This framework makes it easy to climb down data such as Amazon product information.

Scrapy.org/

2.PySpider

Pyspider is a powerful web crawler system implemented by Python. It can write scripts on the browser interface, schedule functions and view the crawl results in real time, store the crawl results using common databases at the back end, and also set tasks and task priorities regularly.

Project address: github.com/binux/pyspi…

3.Crawley

Crawley can crawl the contents of corresponding websites at high speed, support relational and non-relational databases, and export data to JSON and XML.

Project address: project.crawley-cloud.com/

4.Portia

Portia is an open source visual crawler that allows you to crawl websites without any programming knowledge! Simply annotate the pages you are interested in, and Portia will create a spider to extract data from similar pages.

Project address: github.com/scrapinghub…

5.Newspaper

Newspaper can be used to extract news, articles, and content analysis. Use multithreading, support more than 10 languages, etc.

Project address: github.com/codelucas/n…

6.Beautiful Soup

Beautiful Soup is a Python library that extracts data from HTML or XML files. It allows you to navigate, find, and modify documents in your favorite converter. Beautiful Soup can save you hours or even days at work.

The address of the project: www.crummy.com/software/Be…

7.Grab

Grab is a Python framework for building Web scrapers. With Grab, you can build a variety of complex web scraping tools, from simple 5-line scripts to complex asynchronous web scraping tools that handle millions of web pages. Grab provides an API for performing network requests and processing received content, such as interacting with the DOM tree of AN HTML document.

The address of the project: docs.grablib.org/en/latest/#…

8.Cola

Cola is a distributed crawler framework that allows users to write a few specific functions without having to worry about the details of distributed execution. Tasks are automatically distributed across multiple machines and the process is transparent to the user.

Project address: github.com/chineking/c…