Tencent questionnaire all dynamic content, all provided by Ajax interface. It is well known that most search engine crawlers do not execute JS, which means that if the page content is returned by Ajax, the search engine will not be able to crawl and do SEO.

Let’s take a look at the effect

Last year, search engine listings were pitifully low. Even more damning is the inclusion of pages whose search engines display the original HTML title, which is so heavily weighted that it is included with a useless title. At the end of last year after the completion of the implementation of pre-rendering services, the volume of inclusion up slowly, and included titles are all normal. All of this, except that the Nginx access layer configuration is required to change the business code, all the other mechanisms are bypass mechanism. That is to say, to make a set of their own, can be shared by all the same type of business, while not affecting any of the existing business code or process.

PhantomJS to clearance

The problem of Ajax not being able to do SEO had been bothering me for a long time, until PhantomJS was able to parse HTML on the server.

PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast andnative support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

Prepare a PhantomJS task script

I’ll call it spider.js.

/*global phantom*/ "use strict"; Var resourceWait = 500; var resourceWait = 500; var resourceWaitTimer; Var maxWait = 5000; var maxWaitTimer; Var resourceCount = 0; // WebPage module var page = require(' WebPage ').create(); Var system = require('system'); Var URL = system.args[1]; var URL = system.args[1]; // Set PhantomJS viewsize page.viewportSize = {width: 1280, height: 1014}; Var capture = function(errCode){console.log(page.content); // Clear the timer clearTimeout(maxWaitTimer); // Complete, exit phantom. Exit (errCode) normally; }; // resource requests and counts page. OnResourceRequested = function(req){resourceCount++; clearTimeout(resourceWaitTimer); }; OnResourceReceived = function (res) {// Chunk mode HTTP packet back, will trigger resourceReceived event many times, End if (res.stage! == 'end'){ return; } resourceCount--; If (resourceCount === 0){if (resourceCount === 0){if (resourceCount === 0){if (resourceCount === 0){ ResourceWaitTimer = setTimeout(capture, resourceWait); }}; OnResourceTimeout = function(req){resouceCount--; }; OnResourceError = function(err){resourceCount--; }; Page. Open (url, function (status) {if (status! == 'success') { phantom.exit(1); Setwaittimer = setTimeout(function(){capture(2);} else {// Start the timer when the page's initial HTML returns successfully.  }, maxWait); }});Copy the code

You can see the rendered HTML structure in the terminal by executing the PhantomJS command directly

phantomjs spider.js  'http://wj.qq.com/'Copy the code

Command servitization

What does it mean? Because it’s a command, it can’t respond well to the request of the search engine crawler. It’s estimated that we need to servize it. PhantomJS comes with a Web Server Module, but it is always unstable and, as mentioned in the previous article, will sometimes fake death. Let’s give it a simple Web service via Node.

Var express = require('express'); var app = express(); Var child_process = require('child_process'); App.get ('/', function(req, res){var URL = req.protocol + '://'+ req.hostname + req.originalurl; Var content = ''; Var phantom = child_process.spawn('phantomjs', ['spider.js', url]); / / sets the character encoding the phantom stdout. Stdout. SetEncoding (' utf8); On ('data', function(data){content += data.toString(); // Monitor phantomjs' stdout and splicken phantom. }); // Listen for the phantom. On ('exit', function(code){switch (code){case 1: console.log(' load failed '); Res.send (' load failed '); break; Case 2: console.log(' load timeout: '+ URL); res.send(content); break; default: res.send(content); break; }}); });Copy the code

The bypass service

Now that we have a Web service that can run pre-rendered, all that remains is to import the traffic of the search engine crawler into the pre-rendered service and return the results to the search engine crawler. We can easily solve this problem using Nginx, an access layer tool.

Upstream spider_server {server localhost:3000; $Host :$proxy_port; $Host :$proxy_port; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; # when a UA contains Baiduspider, Nginx is used as a reverse proxy. Spider_server if ($http_user_agent ~* "Baiduspider") {proxy_pass http://spider_server; }}Copy the code

This chestnut inside only to baidu crawler to do processing, can be their own crawler are complete.

Free

Having said all that, I suddenly feel that this article is very valuable. Because, there are also special server-side pre-rendering services abroad, but they all have to charge. You can deploy your own bypass rendering service along the lines outlined in this article.