preface

For the first time, I watched a complete technical live broadcast and got in touch with Teacher Cui Qingcai in a quiet distance. It was his netease course that enabled me to enter the door of crawler, successfully develop crawler and crawl to get the data I wanted, and experience the fun of data capture. I would like to take this opportunity to briefly summarize all the knowledge I have learned and heard about crawlers in this live broadcast. If there is any inaccuracy, please correct it and learn from each other. (cry, published after cui God wrote, a lot of pressure, a lot of reference parts, after all, mainly according to cui God ideas to write, Cui God share links: juejin.cn/post/684490…

The article is divided into six parts, according to teacher Cui’s ideas to go on, at the same time interspersed with personal supplement

  • crawl
  • parsing
  • storage
  • Framework scrapy
  • Trying to climb
  • To speed up

crawl

This part is mainly about how to get the data source you want

Crawl type

There are two main types:

  • Server-side rendering (page results are rendered by the server and returned, valid information is contained in the HTML page returned directly from the request)

  • Client rendering (the page result is rendered by JavaScript, HTML is only as a static file, when the client requests, the server does not do any processing, directly returned to the client in the form of the original file, and then according to the JavaScript on THE HTML, generate DOM insert HTML display data)

For different types, the solution is naturally different

Server-side render type workaround

For the first server rendering type, you can directly request the HTML page through various libraries (the data is on the HTML page). Here are the following libraries for further reference:

  • Urllib (Python’s native underlying library)
  • Urllib3 (many new features and functions compared to URllib)
  • Pycurl (The Python implementation of Libcurl)
  • Hyper (new feature support for HTTP2)
  • Requests (The most widely used HTTP request library, recommended)
  • Grab (UrLlib3 and PyQuery encapsulation)

Client-side render type workaround

In this case, the first method is generally unable to request the view of the data to capture, using the following four methods

  • Finding Ajax requests

AJAX is a technique for creating fast, dynamic web pages. AJAX allows web pages to be updated asynchronously by exchanging small amounts of data with the server in the background. This means that part of a web page can be updated without reloading the entire page. There are many examples of applications using AJAX: Sina Weibo, Google Maps, Kaixin001, etc.

Solutions:

Use Chrome/Firefox developer tools directly to view Ajax specific request methods, parameters and other content, and then use HTTP request library simulation

Alternatively, you can set up proxy packet capture to view the interface, such as Fiddler/Charles

  • Emulating browser execution

It is suitable for complex web interface and logic

Use Selenium, Splinter, Spynner, Pyppeteer, PhantomJS, Splash, requests- HTML, etc

  • Extract JavaScript data directly

In this case, data is hidden in JavaSript variables and can be extracted and analyzed directly by regular expressions

  • Simulating JavaScript execution

Direct emulation of browser execution is inefficient and data retrieval can be accomplished by executing JavaScript, which can be assisted by Selenium, PyExecJS, PyV8, JS2py, and other libraries

APP data crawl

  • Common interface direct request

Capture HTTP/HTTPS data packets, using Charles, Fiddler, mitmProxy software

  • Encryption parameter interface

Interface parameters after encryption, each request has a random generation of token one can be processed in real time, using Fiddler, MitMDump, Xposed, etc., followed by the encryption logic cracking, direct simulation of the construction rules can also be directly decomcompiled cracking.

  • Encrypted content interface

The result returned by the interface is encrypted

1. Using Appium, visible can climb, similar to Selenium 2.Xposed using hook to obtain results 3. Decompiler to find out the use of encryption algorithm, simulation can be 4. Rewrite the phone underlying operating system source code

  • Unconventional protocol

Capture all protocol packets Using WireShark Capture TCP packets using Tcpdunp

Data parsing

The data source has got it. There is too much useless information in many places. We need to carry out further data cleaning and processing

  • HTML page data uses four data parsing methods (provided in the corresponding library), Xpath, CSS, regular expressions, and Selector. Each of these parsing rules has its own parsing rules, and each takes some examples to remember to use. Xpath is recommended

  • Other types For json and XML types returned by the interface, you can directly use json and XML2dict libraries to process them

The above analysis is every time to climb a web page data, it is necessary to reconstruct the data analysis rules, very cumbersome trouble, if the amount of data is not large, their own copy and paste is more convenient, so there is no way to solve this problem

Machine learning today, the field of reptiles also has its shadow. Hence intelligent parsing!

The following is divided into four methods:

  • Readability algorithm, which defines different sets of annotations for different blocks, calculates weights to get the most likely block location.
  • Sparse density judgment, calculate the average length of text content in a block of unit number, and roughly distinguish according to the degree of density.
  • Scrapyly self-learning, is a component developed by Scrapy, specify page surface and extract sample results, which can be self-learning extraction rules, extraction of other similar pages.
  • Deep learning, the use of deep learning for supervised learning of parsed locations, requires a lot of annotated data.

Of course, the accuracy of machine learning algorithm can not reach 100% at present, there will be data errors, the water is relatively deep

storage

Once the data is parsed, it has to be stored, and it can be used as a data source for data mining and machine learning, and here’s a common way to store it

  • Files such as JSON, CSV, TXT, images, videos, and audio files
  • Databases are classified into relational databases and non-relational databases, such as MySQL, MongoDB, HBase, etc
  • Search engines, such as Solr, ElasticSearch, etc. make it easy to retrieve and implement text matching
  • Cloud storage, certain media files can be stored in such as Qiniuniu Cloud, Youpaiyun, Ali Cloud, Tencent cloud, Amazon S3 and so on

The framework

From the request data to parse the data cleaning, and finally to the storage, the basic of the crawler thus ended the three parts, the various libraries in python, feel is a mess, below I will simply say about the crawler frame, the open source of the crawler frame to solve complex problems, and encapsulation, provide all kinds of class function let us use is simple, convenient, Do not have to consider those URL to heavy, proxy, thread pool management and so on, each development only need to focus on the capture logic, but want to start with the framework, there is a certain threshold, suggest looking at the source code, think more about their design ideas and implementation logic.

I’ve recently learned scrapy frames, so I’m going to summarize the main logic diagrams here without going into detail

  • For the first time, the start_URL setting is used to crawl the front page, and the built-in start_request method is invoked to initiate requests to be processed by the crawler ENGINE
  • The ENGINE is handled by the SCHEDULER
  • The scheduler queue dispatches requests
  • The Requests request ENGINE, initiated by the scheduler, turns it over to the DOWNLOADER to request the web page
  • Request that the response be returned to ENGINE
  • The ENGINE gives the response returned by the request to the spider to parse
  • The Spider parses the items that return the data to be cleaned up and parses requests for new links that the page might contain, again to ENGINE
  • Request storage; request storage

That’s the main scrapy framework logic, and MIDDLEWARE is called MIDDLEWARE, Divided into downloader middleware (for handling requests and response objects in addition) and Spider Middleware (a hook framework for Scrapy spider handling), You can add custom functionality to handle responses sent by engine to spiders and requests and items sent by spiders to Engine.

Scrapy frame broad and deep, continuous digging…

The climb

As companies are increasingly aware of protecting their data, many small companies have added anti-crawler detection to their websites, making it increasingly difficult for crawlers to crawl

Anti-crawler methods include non-browser detection, IP blocking, account blocking, font anti-crawler, etc

IP address, account number

Service request will be rejected after a long time of climbing, which can be solved by alternate IP climbing and proxy. It can be divided into several cases to deal with:

  • First look for mobile phone sites, App sites, if there is such a site, anti-crawl will be relatively weak, first find a soft persimmon.
  • Use proxies, such as fetching free proxies, buying paid Tor proxies, Socks proxies, etc.
  • Maintain your own agent pool on the basis of agents to prevent agent waste and ensure real-time availability.
  • Set up ADSL dial-up agent, stable and efficient.
  • Find an interface that does not require login. If possible, find an interface that can be climbed without login.
  • Maintain the Cookies pool, using batch accounts to simulate login, and randomly select available Cookies when using

Verification code

  • Ordinary graphic verification code, if very regular and no deformation or interference, can use OCR recognition, can also use machine learning aspects of image recognition to model training, of course, the coding platform is the most convenient way.
  • For arithmetic captcha, it is recommended to use the coding platform directly.
  • For the sliding verification code, you can use cracking algorithms, you can also simulate the sliding (simulation of human drag law first accelerate sliding and then decelerate). The latter is the key to find the gap, you can use the image comparison, can also write the basic graph recognition algorithm, can also docking coding platform, can also use deep learning training recognition interface.
  • You are advised to use a coding platform for tapping verification codes. For mobile verification codes, you can use a verification distribution platform, purchase a special code collection device, or manually verify the code
  • For scanning the TWO-DIMENSIONAL code, you can manually scan the code, but also can docking code platform.

To speed up

When there is a large amount of data, efficient and fast crawling is the key problem at present

  • Multi-process, multi-thread
  • asynchronous
  • distributed
  • To optimize the
  • architecture

Finally hold cui God summary of the brain map, pay tribute to Cui God!

Finally I want to say my own feelings, have the crawler to crawl data table automatic generation, solve the repetitive life meaningless things, coincidence in some looking for the joy of achievement, and a lot of in-depth study, thinking of the crawler as the main direction of future, but as I constantly learn to know now the crawler is not particularly clear, Then due to many reasons, and the reference cui god advice, finally decided to take the crawler as interest as a feeder with a do, will, as always, continue to study, because I know that grab a large amount of data that a sense of accomplishment is real, is not about any individual of external factors, the tangle of all if you also have the same doubt, I hope my sharing can give you some reference. When nothing is wrong, let’s leave everything behind and return to the original mentality of learning crawler. We can climb more data and feel the charm of it.