Anti-crawler: write the same dynamic crawler as human in a few lines of code

Welcome to Tencent cloud technology community, get more Tencent mass technology practice dry goods oh ~

Author: David Lee

Phantomjs profile

What is a Phantomjs

Phantomjs is a Full Web Stack No browser required, which is often referred to as a headless browser — or better yet, a web parser without interfaces.

The characteristics of Phantomjs

It’s faster than a typical browser because it’s headless — it doesn’t have to render visual web interfaces, and because it’s a complete Web protocol stack, it not only provides JavaScript apis, but also supports a full range of web standards: DOM manipulation, CSS selectors, JSON, Canvas, and SVG, and file system apis and operating system apis. In addition, it also provides a number of Web Event Handles, which is the key point that differentiates Phantomjs from Web automation tools like Selenium, which will be explained in detail in the subsequent security inspection.

The characteristics of Phantomjs are summarized in the following table:

A summary of apis provided by Phantomjs

The WebPage API

HTML documents
DOM
Handle cookies
Handle events
Send requests
Receive responces
Wait For AJAX
User Interaction
Render Full Images

The System API

Get OS information
command-line

The FileSystem API

Read data from file
Writing JSON data to file

The WebServer API

process client requests

summary

Given the above, Phantomjs is especially good to write about! Climb! Worm!

JavaScript allows you to load resources dynamically or perform actions that mimic humans. Support for DOM manipulation to structure pages; CSS support can quickly and easily complete page document rendering, for us to save images or PDF everywhere; Support for JSON, Canvas, and SVG is a plus for working with data or multimedia pages; The file system API also makes it easy to format and store the results.

Common usage of Phantomjs

1: Interactive mode /REPL/Interactive mode

After downloading Phantomjs and running Phantomjs directly into interactive mode, we can use Phantomjs as a JavaScript interpreter for operations, JS methods, viewing “browser” information using window.navigator objects, and so on. Feel free to type in some commands if you have Phantomjs installed. At the end of the experience type phantom. Exit () ‘exit.

Figure: Phantomjs in REPL mode

If you are new to JS, this mode may be larger than chrome’s console bar to practice js commands. Furthermore, this mode is not used very often and we use Phantomjs more as a binary tool.

2: as a binary tool

It is also a Phantomjs one of the most commonly used mode: Phantomjs/scripts/somejavascript js to run a JavaScript script. You can use the various apis provided by Phantomjs in your scripts (the markdown syntax for KM does not support in-page anchors; see the “Phantomjs API Summary” earlier in this article);

Open the page

Webpage creates an instance of webpage, and then uses the open method to open the home page of qq.com, if success is returned, the log prints the webpage title, and we exit.

/****************************************************************
* create an instance of the webpage module
* get the page and echo the page's title
* file: somejavascript.js
* auther : Taerg
* date : 12/05/2017
*****************************************************************/
var page = require('webpage').create(); // open the webpage // defined callback: check the status and echo teh status page.open("http://www.qq.com", function(status) { if ( status === "success" ) { console.log("Page load success.The page title is:"); console.log(page.title); } else { console.log("Page load failed."); } phantom.exit(0); });Copy the code

To get a cookie

Of course, we can also use page.content to get all the content of a page, and use page.cookies to get cookies.

As follows, we obtained the cookie when we visited the website of King of Glory and printed it in the log in the form of key-value pairs:

/****************************************************************
* create an instance of the webpage module
* echo the cookies
* auther : Taerg
* date : 12/05/2017
*****************************************************************/
var page = require('webpage').create();
console.log(1);
   page.open("http://www.qq.com/".function(status) {
      if (status === 'success') {
          var cookies = page.cookies;
          for(var i in cookies) {
            console.log(cookies[i].name + '=' + cookies[i].value);
          }
      }
   phantom.exit(0);
   });
Copy the code

The corresponding output is:

Graph: phantomjs_getcookie

Execute JavaScript

Phantomjs is a headless “browser” and of course has excellent JavaScript support. Below, we define a simple function that retrieves the page title and returns it. Simply call Pag.evaluate () to execute this JavaScript code.

/****************************************************************
* create an instance of the webpage module
* include system module
* auther : Taerg
* date : 12/05/2017
*****************************************************************/
var system = require('system');
   var url = system.args[1];
  console.log(url);
   var page = require('webpage').create();
   page.open(url, function(status) {
     if ( status === "success" ) {
       var title = page.evaluate(function () {
         return document.title;
  });
  console.log(title);
  }
})
  phantom.exit(0);
Copy the code

Using a third-party JS library (such as jQuery)

We can also use a third-party JavaScript library in Phantomjs if it’s too much hassle to duplicate the wheel ourselves with JavaScript code. Phantomjs gives us a way to use third-party libraries in 2:

Method 1: includeJs()
InjectJs ()

The two are often used interchangeably, with the main difference being that injectJs are loaded blocking while includeJs are loaded dynamically. InjectJs means that the program blocks when the code executes, loads the JS file into memory, and continues running without making requests for the file while manipulating the page. IncludeJs, on the other hand, loads the js file dynamically when the page loads it.

Example code is as follows:

/****************************************************************
* create an instance of the webpage module
* load third part js lib
* auther : Taerg
* date : 12/05/2017
*****************************************************************/
var page = require('webpage').create();
// open the webpage
page.open("http://www.qq.com".function(status) {
  page.injectJs('jquery321.js');
  //different with page.includeJs
  if (status === 'success') {
      var aoffset = page.evaluate(function() {
         return (typeof jQuery === 'function')? jQuery.fn.jquery : undefined; }); console.log(aoffset); }else{
    console.log('open error');
  }
phantom.exit(0);
});
Copy the code

The output is as follows:

We first inject the local jQuery file with version 3.2.1, and then you can use jQuery to view the jQuery version. Of course, this is just to verify that jQuery loads successfully, before we can use other jQuery to provide shortcuts to our requirements.

Saves the screenshot of the specified page range

When we are working with a page, there is often a need to save screenshots of the page, such as: save the appearance of the page BUG, the retention of key information, and so on. In this case, we can use the Render method provided by Phantomjs page, which supports saving the entire page (autoscroll screenshots) and specified sections of the page (.png,.pdf,.jpg, etc.).

Here, we want to get the details of the weather site “MyWeather” without paying attention to the various news and ads on the page. We just specify the interval and save the screenshot:

/****************************************************************
* phjs_clip.js
* get the weather pic
* auther : Taerg
* date : 12/05/2017
*****************************************************************/
var page = new WebPage();
page.open('http://www.weather.com.cn'.function (status) {
    if(status ! = ='success') {
        output.error = 'Unable to access network';
    } else {
        page.clipRect = {
            top: 200,
            left: 750,
            width: 300,
            height: 500
        }
        page.render('weather.png');
        console.log('Capture saved');
    }
    phantom.exit();
});
Copy the code

The saved images are as follows:

Graph: phantom_get_weather

Three lines of code rage against “anti-crawler”

Normal User Access

When we normally use the browser to https://media.om.qq.com/media/5054676/list, everything is normal, the diagram below:

Graph: safari_get_omqq

According to this anti-crawler author’s explanation, the client calculates a ticket through JavaScript, and the cookie contained in it will be verified again at the server. If the verification passes, the data will be returned; if the verification fails, the data will not be returned. As shown below:

Graph: anti_spide

Now let’s try pulling the page data automatically through the script

Ordinary static crawler

curl get

Curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl

Graph: curl_get_omqq

It was intercepted by the anti-crawler system, as shown above.

Let’s try Python again, using the most common “HTTP for Humans” requests. Get request:

Graph: request_get_omqq

As you can see, it’s still intercepted by the anti-crawler mechanism.

Analysis of anti-crawler principle

Through manual browser access and packet capture analysis, we can see:

2. The first request directly requests the target URL, because there is no valid ticket, return 403. Also included in the 403 page are two JavaScript files

Graph: load_js

3. The next two requests are to load the JavaScript script on page 403

4. After the loading operation is completed, a legitimate ticket is obtained and added into the cookie to initiate a request again, resulting in the fourth request. The diagram below:

Graph: omqq_signiture

5. The request in clause 4 has a valid ticket, so it is not rejected by 403Forbidden. Instead, a customer ID is added and 302 redirected to the data page. The following figure shows the id signature added in set-cookie.

Graph: redirect

6. At this point, the cookie already contains the valid signature and customer ID, and the JSON data is requested. Get the normal page:

Graph: omqq safafi_get)

Dynamic crawler based on Phantomjs

At this point, based on the previous analysis, we can use Phantomjs to gradually simulate human requests and bypass the anti-crawler system. First look at the code:

/****************************************************************
* phjs_antispider.js
* anti-anti-spider script for https://media.om.qq.com/media/5054676/list
* auther : Taerg
* date : 12/05/2017
*****************************************************************/
var page = require("webpage").create();
var system = require("system")
url = system.args[1];
headers = {};
page.customHeaders = headers;
page.settings = {
    javascriptEnabled: true,
    userAgent: 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'}; page.viewportSize = { width: 1024, height: 768 }; page.open(url,function(status) {
        page.injectJs('jquery321.js');
        if(status ! = ='success') {
        console.log('Unable to access network');
        } else {
            page.evaluate(function() {
                var allElements = $(The '*');
                for ( var i = 0; i<allElements.length; i++ ) {
                    if (allElements[i].href) {
                        javascript_code = allElements[i].href.match("javascript:(.+)");
                        if (javascript_code){
                            console.log(javascript_code[0]);
                            eval(javascript_code[0]); }}}}); } window.setTimeout(function() {
        console.log("crawl_content:"+page.content+"content_end")
        phantom.exit()
    },
    1000
);
phantom.exit();
});
Copy the code

In the code above:

So we’re going to modify the page.Settings, set it in JavaScript,
While customizing user-agent, forging browser,
Set the resolution to further fake human browsing,
Import jQuery files when opening the page,
Use jQuery’s selectors to select all elements in the page,
Run JavaScript scripts if they exist in the element,
Set the page timeout and print the page content.

We can see that our request has bypassed the anti-crawler mechanism.

Graph: phantomjs_get_omqq

3 lines of code crawl: A humanoid dynamic crawler based on Casperjs

God, I am a developer, you tell me to capture and analyze things I don’t!! Baby just wants to crawl some data…

Let’s do this with three lines of code:

The first line creates a Casper instance
The second line initiates the request
The third line executes and exits


/****************************************************************
* crawl the anti-aipder website: om.qq.com
* auther : Taerg
* date : 17/05/2017
*****************************************************************/

var casper = require("casper").create();
casper.start('https://media.om.qq.com/media/5054676/list'.function() {
  require('utils').dump(JSON.parse(this.getPageContent()));
});
casper.run(function() {
    this.exit();
});
Copy the code

The results are as follows:

Graph: casper_get_omqq

These three lines of code not only successfully bypass the restriction of anti-crawler, but also the JSON method will also show (store) the data in a structured way, which can greatly simplify the development complexity of complex crawler.

These three lines of code use — CasperJS.

CasperJS officially claims to be an open source navigation script and testing tool, but it’s actually pretty darn fun to use. Specifically include:

defining & ordering navigation steps * filling forms
clicking links
capturing screenshots of a page (or an area)
making assertions on remote DOM
logging & events
downloading resources, even binary ones
catching errors and react accordingly
writing functional test suites, exporting results as JUnit XML (xUnit)

In addition, the most powerful thing about CasperJS is that I don’t need to say anything more after I’ve given you a brief introduction here, CasperJS has a wealth of documentation and sample code. This was much friendlier than the core documentation was TODO, which was Phantomjs, where we had to write all kinds of documentation.

### Related reading

Master Python web crawler: Web crawler learning path this article has been authorized by the author of Tencent Cloud technology community, reprint please note the article source

The original link: https://www.qcloud.com/community/article/636391