Abstract:This paper introduces the relevant knowledge points and skills of using Python to develop web crawler from five aspects of grasping, parsing, storage, reverse crawling and acceleration, and introduces how to take different measures to efficiently grab data in different scenarios.

Some time ago, I participated in a sharing activity on the topic of Python web crawler, and mainly shared some experience summary since I engaged in the research of web crawler in the form of live broadcast. The whole sharing is divided into three stages. In the first stage, I introduced the relevant process since I was engaged in programming development in university; in the second stage, I introduced the formal sharing process of web crawler and summarized some key points of web crawler development in detail; in the third stage, I answered some questions and gave some gifts by lottery. So here I will make a summary of the main content I shared yesterday, I hope you can support!

blanket

The theme of the sharing is “Robust and Efficient Web crawler”. This sharing introduces relevant knowledge points and skills of web crawler development using Python from five aspects of grasping, parsing, storage, reverse crawling and acceleration. It also introduces how to take different measures to efficiently grab data in different scenarios. Including Web crawler, App crawler, data storage, agent selection, verification code analysis, distributed crawler and management, intelligent analysis and other aspects of the content, in addition, combined with different scenarios to introduce some commonly used tool kits, all the content is the essence of my experience since I engaged in the research process of Web crawler.

crawl

For crawls, we need to learn to use different methods to deal with crawls in different situations.

Most of the time, the target of the crawl is either a web page or an App, so these two broad categories are covered here.

For web pages, I divide them into two categories, namely server-side rendering and client-side rendering. For App, I divide them into four types of interface forms — ordinary interface, encrypted parameter interface, encrypted content interface and unconventional protocol interface.

So the overall outline looks like this:

  • Web crawl
    • Server side rendering
    • Client-side rendering


  • App crawl
    • Common interface
    • Encryption parameter interface
    • Encrypted content interface
    • Unconventional protocol Interface

Crawls/web crawls

Server-side rendering means that the result of the page is rendered by the server and returned, and the valid information is contained in the requested HTML page, such as the cat’s Eye movie site. Client-side rendering means that the main content of the page is rendered by JavaScript, and the real data is obtained through Ajax interfaces, such as Taobao, Weibo mobile version and other sites.

Server rendering is relatively simple, using some basic HTTP request libraries to achieve crawling, such as URllib, URllib3, Pycurl, Hyper, Requests, Grab, and other frameworks, of which the most application is probably requests.

For client rendering, I’ve divided up four more methods:

  • In this case, you can directly use Chrome/Firefox developer tools to directly view Ajax specific request methods, parameters and other content, and then use HTTP request library simulation, in addition, you can set the proxy packet capture to view the interface. Such as Fiddler /, Charles.
  • Simulated browser execution, this situation is suitable for the web interface and logic is more complex, can be directly visible to climb the way to climb, You can use Selenium, Splinter, Spynner, Pyppeteer, PhantomJS, Splash, requests- HTML, etc.
  • Direct extraction of JavaScript data is a case where the real data is not fetched through an Ajax interface, but is contained directly in a variable of the HTML result, and can be extracted using regular expressions.
  • Simulating the execution of JavaScript, in some cases, directly simulating the execution efficiency of the browser will be low. If we figure out some execution and encryption logic of JavaScript, we can directly execute the relevant JavaScript to complete logic processing and interface requests. For example, use Selenium, PyExecJS, PyV8, JS2py, and other libraries.

Crawl/App crawl

For App crawls, there are four processing cases:

  • For common unencrypted interfaces, the specific request form of direct packet capture to the interface is good. Available packet capture operators include Charles, Fiddler, and MitmProxy.
  • For encryption parameters of the interface, a method can be real-time processing, such as Fiddler, MitMDump, Xposed, etc., another method is to encrypt logic analysis, direct simulation can be constructed, may need some decompression skills.
  • For the encrypted content of the interface, that is, the interface return results completely do not understand what is, you can use the visible climbing tool Appium, can also use the Xposed hook to obtain the rendering results, can also be decompilated and rewrite the phone bottom to achieve the analysis.
  • For unconventional protocols, you can use Wireshark to capture packets of all protocols or Tcpdump to capture TCP packets.

This is the classification of the crawl process and how to handle it.

parsing

For parsing, for HTML type pages, common parsing methods are actually no more than a few, re, XPath, CSS Selector, in addition to some interfaces, common may be JSON, XML type, use the corresponding library for processing.

These rules and analytic methods are actually very tedious to write, if we have to climb tens of thousands of websites, if each website to write the corresponding rules, so not too tired? So intelligent parsing is a requirement.

Intelligent parsing means that when presented with a page, the algorithm can automatically extract the title, body, date, and other content while removing unnecessary information, as shown in the image above, which is the result of automatic parsing in Safari’s built-in reading mode.

For intelligent parsing, the following four methods are divided:

  • Readability algorithm, which defines different sets of annotations for different blocks, calculates weights to get the most likely block location.
  • Sparse density judgment, calculate the average length of text content in a block of unit number, and roughly distinguish according to the degree of density.
  • Scrapyly self-learning, is a component developed by Scrapy, specify page surface and extract sample results, which can be self-learning extraction rules, extraction of other similar pages page surface.
  • Deep learning, the use of deep learning to perform line by line supervised learning of parsed locations, requires large amounts of annotated data.

If you can tolerate a certain error rate, you can use intelligent parsing to save a lot of time.

At present, I am still exploring this part, and the accuracy rate needs to be improved.

storage

Storage, that is, the selection of appropriate storage media to store the results of crawling, here is divided into four storage methods to introduce.

  • CSV, CSV, TXT, image, video, audio, etc. Some commonly used libraries include CSV, XLWT, JSON, PANDAS, pickle, python-docx, etc.
  • Databases are classified into relational databases and non-relational databases, such as MySQL, MongoDB, HBase, etc. Common libraries include Pymysql, PyMSSQL, Redis-py, PyMongo, Py2neo, and Thrift.
  • Search engines, such as Solr and ElasticSearch, are easy to search and implement text matching. Common libraries include ElasticSearch and PySolr.
  • Cloud storage, some media files can be stored in qiniu cloud, Youpaiyun cloud, Ali Cloud, Tencent Cloud, Amazon S3, etc. Common libraries include Qiniu, Upyun, Boto, Azure – Storage, Google -cloud-storage, etc.

The key to this part is to connect with the actual business and see which approach best meets the business needs.

The climb

Anti-crawler is a key part, crawler is now more and more difficult, many websites have added a variety of anti-crawler measures, here can be divided into non-browser detection, IP blocking, verification code, account blocking, font anti-crawler and so on.

The following mainly from the IP address, verification code, account sealing three aspects to explain the anti-crawl processing methods.

Anti-crawl/IP blocking

For the case of sealed IP, it can be divided into several cases:

  • First look for mobile phone sites, App sites, if there is such a site, anti-crawl will be relatively weak.
  • Use proxies, such as fetching free proxies, buying paid proxies, using Tor proxies, Socks proxies, etc.
  • Maintain your own agent pool on the basis of agents to prevent agent waste and ensure real-time availability.
  • Set up ADSL dial-up agent, stable and efficient.

Anti-crawl/verification code

There are many kinds of verification codes, such as common graphics verification code, arithmetic verification code, sliding verification code, tap verification code, mobile verification code, scan two-dimensional code, etc.

  • For ordinary graphic verification code, if very regular and no deformation or interference, can use OCR recognition, can also use machine learning, deep learning to carry out model training, of course, the coding platform is the most convenient way.
  • For arithmetic captcha, it is recommended to use the coding platform directly.
  • For sliding captchas, parsing algorithms can be used, or sliding can be simulated. The latter is the key to find the gap, you can use the image comparison, can also write the basic graph recognition algorithm, can also docking coding platform, can also use deep learning training recognition interface.
  • You are advised to use a coding platform for tapping verification codes.
  • For mobile verification codes, you can use a verification distribution platform, purchase a special code collection device, or manually verify the code.
  • For scanning the TWO-DIMENSIONAL code, you can manually scan the code, but also can docking code platform.

Anti-crawl/block account

Some websites require login to be accessed. However, if an account requests too frequently after login, it will be blocked. To avoid being blocked, you can take the following measures:

  • Look for mobile sites or App sites, which are usually in the form of interfaces and have weak verification.
  • Find an interface that does not require login. If possible, find an interface that can be climbed without login.
  • Cookies pool maintenance, using simulated by batch account login, when using random use Cookies can are available, and implemented: https://github.com/Python3WebSpider/CookiesPool.

To speed up

When the amount of data to be crawled is very large, how to grab data efficiently and quickly is the key.

Common measures are multithreading, multi-process, asynchronous, distributed, detail optimization and so on.

Acceleration/multi-threading, multi-process

Crawler is a network request intensive task, so the efficiency of crawler can be greatly improved by using multi-process and multi-thread, such as threading and multiprocessing.

Acceleration/asynchrony

Change the crawl process to a non-blocking form and process it when there is a response, otherwise other tasks can run during the waiting time, Examples include Asyncio, AIoHTTP, Tornado, Twisted, gEvent, Grequests, Pyppeteer, PySpider, Scrapy, etc.

Acceleration/distribution

The key to distributed tasks is to share tasks in queues, using tasks like celery, huey, rq, RabbitMQ, kafka, etc., as well as ready-made frameworks like PySpider, scrapy-redis, scrapy-cluster, etc.

Acceleration/optimization

Certain optimization measures can be taken to achieve crawl acceleration, such as:

  • DNS cache
  • Use faster parsing methods
  • Use a more efficient way to remove weight
  • Module separation control

Acceleration/architecture

If the distributed system is set up, we can use two architectures to maintain our crawler project in order to realize efficient crawler and manage scheduling and monitoring operations.

  • Package Scrapy items as Docker images and use K8S to control the scheduling process.
  • Deploy your Scrapy projects to Scrapyd and manage them using dedicated management tools such as SpiderKeeper and Gerapy.

In addition, for this part of the content, there is actually a richer mind map I made, the preview picture is as follows:


Click to follow, the first time to learn about Huawei cloud fresh technology ~