Anti-reptilian and anti-anti-reptilian?

Starting with this chapter, we will move into the anti-reptilian chapter.

The first reader to hear the name must be stunned. Now let’s first introduce what is a reptile, anti-reptile, anti-anti-reptile.

The crawler is essentially the code we learned earlier, using requests. Get (“http://xxx.com”) to get the source code for the site.

However, most of the time, we get valuable data, and the website developers do not want us to get their data, they have a lot of anti-crawler strategies, do not let us so easy to crawl the data. In fact, the anti-crawler strategy mainly consists of three aspects:

  • ①JS encryption: HTML is generated with JS code, don’t let us get HTML.
  • (2) Forbidden IP: when we write crawler, we usually use an IP to crawl. If the visit frequency is too high, the website developer can not let us visit the website with this IP.
  • ③ Verification code: Input verification code before obtaining data.

Of course, in addition to these three, there are other things, such as user-Agent recognition, which are more basic, and we won’t expand it here.

Anti-crawler, in fact, is to provide solutions for the above three aspects:

  • ① For JS encryption: we can use a headless browser to render JS, and then parse the rendered HTML code.
  • ② For forbidden IP: we can use proxy IP and other ways, with different IP access to the website.
  • ③ Verification code: we can identify the verification code, can also use the cloud code manual verification code platform.

About the story of reptile, anti-reptile and anti-reptile, this answer on Zhihu depicts a very vivid image.

How to deal with website anti-crawler strategy? How to efficiently crawl large amounts of data? – ShenYuBao answer – zhihu www.zhihu.com/question/28…

This chapter content

From the above introduction, readers should have a basic understanding of anti-crawler. This chapter will be the first anti-crawler scheme: JS encryption, anti-crawler.

Try to crawlbaidu.com

Before introducing anti – crawler method, we first take Baidu test. Here suppose we want to climb baidu search results for beautiful women, namely this link: www.baidu.com/s?wd=%E7%BE… (Note: URL is URL encoded here)

Let’s try it out with the previously super useful Requests library.

If you know HTML a little bit, you will know that there is no so-called beauty picture here, but also let us jump back to www.baidu.com/.

What is a headless browser?

We saw nothing above with requests. Get, but we accessed www.baidu.com/s?wd=%E7%BE with a browser… When I did see the search results.

We couldn’t use Requests. Clearly baidu’s backside identified us as inferior crawlers, despised us, and didn’t give us anything.

Since we can use a browser, can we call our browser through an Api and crawl the search results?

Yes, there is, and there is a Browser called Headless Browser. If we had an interface, we would probably have to use a graphics card to render the graphics page, which is very field-intensive, and is designed specifically for crawler developers.

Phantomjs

The main headless browser we’re introducing here is PhantomJS. Although it is no longer updated, it does not prevent us from using it.

Download PhantomJS

We first download and install PhantomJS, download links: phantomjs.org/download.ht… .

After downloading it, unzip it and you’ll see an executable file for PhantomJS under the bin directory. (Phantomjs.exe for Windows)

To install selenium

Next we need to install Selenium, using the PIP command directly.

pip install selenium
Copy the code

Selenium is a tool for Web application testing. It is compatible with various browsers, including PhantomJS, Chrome, FireFox and more.

Phantomjs is written in JavaScript by default. With Selenium installed, we can operate in Python.

Use PhantomJS to climb Baidu

With that installed, let’s try phantomJS to crawl baidu again. The code is as follows:

from selenium import webdriver

exe_path = "/usr/bin/phantomjs"
driver = webdriver.PhantomJS(executable_path=exe_path)

driver.get("https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3")

driver.save_screenshot("screenshot.png")  # cut a figure
print(driver.page_source)  # print source code

driver.quit()
Copy the code

You need to configure exe_path based on your phantomJS download path. Then you first create a driver object and call GET to access the page, which automatically renders the internal JS and displays the results.

Note: Call driver.quit() to exit when you’re not using it, otherwise there will be a lot of PhantomJS processes in the background.

Here we call selenium’s screenshot method to see the results.

You can see we even got a screenshot of the website. Selenium also supports many operations that can only be done with JS, such as simulating clicking, scrolling, and more, in addition to screenshots. You can do your own research if you are interested.

Modify the UA of PhantomJS

The default user-agent for PhantomJS is PhantomJS. If we had this with us and it would have been easy for the site to detect, we could have added configuration changes to PhantomJS ‘UA when we created the driver and made it look like Chrome. The code implementation is as follows:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap['phantomjs.page.settings.userAgent'] = ('the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36')
exe_path = "/usr/bin/phantomjs"
driver = webdriver.PhantomJS(executable_path=exe_path, desired_capabilities=dcap)
Copy the code

Chrome’s headless mode

PhantomJS visitor detection

Of course, there are times when you’ll find a site that you can’t crawl with PhantomJS, as I’ve seen. This is because PhantomJS is built on the Qt framework. The way Qt implements the HTTP stack sets it apart from other modern browsers.

In Chrome, the head that makes an Http request is as follows:

GET / HTTP/1.1
Host: localhost:1337Connection: keep-alive Accept: text/html,application/xhtml+xml,application/xml; q=0.9,image/webp,*/ *; Q = 0.8 the user-agent: Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 Accept-Encoding: gzip, deflate, sdch Accept-Language: en-US,en; Q = 0.8, ru; Q = 0.6Copy the code

At PhantomJS, however, the same HTTP request looks like this:

GET / HTTP/1.1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.98. Safari/534.34Accept: text/html,application/xhtml+xml,application/xml; q=0.9, */ *; Q =0.8 Connection: keep-alive accept-Encoding: gzip accept-language: en-us,* Host: localhost:1337Copy the code

You’ll notice that PhantomJS header is different from Chrome(and, as it turns out, every other modern browser) with some subtle differences:

  • The last line appears for host
  • Connection headers are case – mixed
  • The only accepted encoding value is gzip

Check these HTTP header changes on the server, which should recognize the PhantomJS browser.

Selenium+Chrome

If this is the case, you might want to consider using another headless browser, such as Chrome’s headless mode.

Chrome’s headless mode and Selenium can also be used together. To do this, you need to download chromedriver and use the selenium API. The following is a simple example, if the reader is interested in this, you can refer to their baidu search tutorial.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='/usr/bin/chromedriver')
Copy the code

Another tool: Splash

There’s also a tool called Splash, which is a JavaScript rendering service based on Twisted and QT5 that provides an Http API.

Splash performs much better than PhantomJS and Chrome’s headless mode and supports concurrent rendering, but requires running Docker.

Two popular Python crawler frameworks, Scrapy and PySpider, use Splash as the JS rendering engine.

Here the author is simply introduced, if the reader is interested in this, you can consult their own Baidu search tutorial.

Reference

  • Anti-crawler detection for PhantomJS visitors