A very concise Python crawler framework

Update – Case Study framework Usage: For fun -100 lines of code to capture the entire season data of all NBA players

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = would like to learn Python the crawler, Can order beside the oh ~ 🈲 🈲 🈲 🈲 g.lgcoder.com/ad1/index.h… 🈲 🈲 🈲 🈲

Although not a professional reptile engineer, as a Pythoner, I have always been fascinated by reptiles.

Python has many crawler frameworks, such as Scrapy, PySpider, and so on. I am not a person who particularly likes using wheels, so BASED on my limited knowledge of crawler and combined with other frameworks, I built a particularly convenient wheel: PSpider, which also deepened my understanding of concepts such as crawler framework, multi-threading and multi-process.

Xianhu /PSpider Github address of PSpider frame: Xianhu /PSpider · Github, welcome everyone to clap bricks and thumbs up.

From the beginning, the framework was designed to be “simple”, avoiding the use of advanced third-party libraries and keeping the amount of code to a minimum. Therefore, this framework can be used as a reference for practicing writing crawler framework.

Let’s talk about what PSpider has done for me.

More than 100,000 pieces of startup information were captured from the website of a domestic financing platform
Capture of a domestic science and technology news website all news data about 70,000
Capture all the internal data of a headhunting company about hundreds of thousands of pieces (you should know what the data is, has not been made public)
It captures all the information of the four major mobile phone application markets in China and updates them on a daily basis, about 200,000 pieces per day
Capture sina Weibo, Sogou wechat public number and other data, millions of magnitude
Grab the legal judgment documents of Chinese judgment documents online, about 10 million
Grab the question bank data of educational APP, including questions, answers and explanations, about 200,000 pieces
Also grabbed a lot of small websites, small applications, here is not a list

It has been nearly two years since I started PSpider. For two years, it was modified almost every week, adding features, fixing bugs, changing interfaces, etc., to suit different application scenarios. In the last six months, PSpider began to take shape and stabilize gradually, and was applied in the company’s actual projects.

PSpider has very little code, less than 700 lines of code. There are also quite a few comments and docStrings, but here’s an overview of the framework code:

Here simply say about the structure and function of the framework, what do not understand, you can read the source code, should be easy to understand. The PSpider framework consists of three modules, namely Utilities, Instances, and Concurrent.

Utilities module: it mainly defines some tool functions and tool classes, which abstracts the specific and general processes in crawler work into functions or classes, in order to save crawler engineers’ time. For example, define the UrlFilter class to filter urls, define the Params_chack decorator to examine function input parameters, and so on. There are more functions in the module, about 15 or so, the specific can go to the code to view, function naming is more standard, annotations are also detailed.
Instances module: it mainly defines three working classes, namely Fetcher, Parser and Saver, that is, the actual working process of crawlers in the working process. * If the crawler framework is compared to a factory, the Concurrent module defines multiple workshops and performs corresponding scheduling and information synchronization, etc., the Instances module defines the work flow of workers in each workshop, and the Utilities module defines some tools and machines needed in the production process. When using a framework, it is common to inherit and overwrite the top three classes, especially the Parser class, meaning that all three need to be customized, and the framework is not fully generic at this stage. The Fetcher class does a simple fetch based on the URL and returns the fetch content. The Parser class parses the captured content and generates items to be saved and a list of urls to be captured. The Saver class saves the Item. (See the flow chart above for details). * In addition, none of them need to rewrite the work function in each class, which takes into account the problems encountered in the process and makes the most correct “response.”
Concurrent module: mainly defines a thread pool and a process and thread combination pool, which are used for safe and reasonable scheduling of thread/process, data sharing and synchronization between process threads, etc. * if the parsing class is simple and does not consume CPU resources, you can use a ThreadPool ThreadPool, which will generate the appropriate number of fetching, parsing, and saving child threads based on the parameters. However, if the parsing class is complex and cpu-intensive, it may be inefficient to use multiple threads only due to the GIL problem in Python. In this case, you can consider using ProcessPool, which will start multiple fetching threads in the main process and put the parsing process in different processes. Improve the efficiency of crawler grasping and parsing. * In addition, this module defines a monitoring thread that continuously monitors the task status during the crawler’s work.

In addition to three more purposeful modules, the framework defines a more detailed debug logging specification, as well as other Python tricks (decorators, dynamic classes, etc.). As well as being a Python crawler framework, it is also a good source for getting started with Python. For more functions, configurations, parameters, etc., you can view code learning.

The main purpose of the framework is to make crawler engineers focus more on constructing reasonable request, parsing web pages and storing web pages when writing crawlers, instead of wasting time on how to write tool functions, how to conduct thread scheduling, how to conduct process communication, how to ensure thread and process exit normally, etc. No which frame is perfect, no which frame is completely universal, smooth, stable is the hard truth.

The framework isn’t perfect yet, and I’m constantly thinking about it and updating it, almost weekly. If you have any functional suggestions or other great architectural ideas, please leave me a comment or ask for Issues and Pull requests on Github, and I will actively consider your suggestions.

The next major plan is to adapt to a distributed crawler to improve the grasping efficiency.

Here are two log outputs from the test flow:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Article reprinted by: Laughing tiger source: Zhihu

A very concise Python crawler framework

Related Posts

TensorFlow article | TensorFlow 2 x distributed an overview of the training

TensorFlow keynote | Google developers conference 2018

Noun explanation | Anchor Boxes – the key to high quality target detection