The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article comes from Tencent Cloud author: Peacock

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)

Different websites choose different technical strategies and different framework combinations.

Selenium Framework: I have named this framework “Unable to Block Crawler Spiderman”. The advantage of this framework is that it simulates the browser, which is the equivalent of you programmatically manipulating the browser to open the web site you need to crawl. The benefit is that you can avoid being blocked. Because when we emit network situations using Python’s Requets library, you must first construct the HTTP request header. But some sites crawl very strict, can directly identify your current access to normal user behavior. Therefore, if the target site is uncrawlable when making a request, then only using the Selenium framework is the best technology choice. What it can do is collect data from your web site as long as it is accessible by your browser, unless it is not.

Selenium framework advantages: The climbing capability is strong, for which the crawl up a bad website or the need to click on the submit site, I’m doing trademark net data crawl, when the boss asked to crawl all site data, tens of millions of trademark but this website the climb, but also need to click confirm button, then can enter the trademark comprehensive search page, Then search according to the registration number to enter the list page, and then click from the list page to enter the trademark details page, and then click from the details page to enter the trademark process page. At that time, in order to solve the problem of crawling millions of data a day, I adopted the Python request library + proxy IP pool technical architecture at the beginning, and then started multi-process. Unfortunately, the target website soon recognized that your request was not a normal user, because you were too fast. So we had to switch to selenium+ multi-process technology price combination.

Disadvantages of the Selenium framework: It is slow and suitable for crawlers that do not require a lot of data to crawl on a daily basis, because Selenium requires that you open a browser and then simulate clicking on a web page, which is as fast as you open a browser and a web address to visit a web site by hand. This speed is relatively slow. If you want to collect only 10,000 to 20,000 pieces of data per day, you can use the Selenium framework. Because it’s more stable and reliable.