Recently, I was in charge of one of the sites included abnormal, take advantage of weekend free to talk about the whole diagnosis process. There are two core problems, server architecture and website architecture caused by; This article only shares collection exceptions caused by the server architecture.

First, introduce yourself. I work for an enterprise in Shenzhen and have been affiliated with the outsourcing company of Party B for a long time. As is known to all, SEO outsourcing companies are connected with most of the websites of small enterprises, and the keywords of these websites are often just to change a TDK to complete the ranking.

In addition, the vast majority of small and medium-sized site architecture is very simple, open source CMS+ a single cloud server (virtual host)+CDN(this is still a little operation and maintenance ability company). Given the experience above, I was completely unaware that server architecture issues could also occur.

First, the discovery of anomalies

It can be clearly seen from Figure 1 that the collection was normal in the middle and late march, and the problem appeared to float between March 31 and April 25. That is to say, there must be something wrong with the site that led to the abnormal collection.

! [SEO included abnormal diagnosis: Load balancing architecture lead to problems and solutions of SEO SEO optimization SEO promotion 1] zhang (https://images.lusongsong.com/zb_users/upload/2020/08/202008227744_772.jpg)

I began to conduct troubleshooting according to conventional methods. In particular, some parameters of server logs were not excluded, which led to the discovery of problems, as follows:

1.1. Webmaster platform simulates crawler grasping, normal.

1.2. The crawler grasping quantity of search engine is increasing and tends to be normal. There are anomalies here, the investigation of false spider crawlers in the grasp of data, the real Baidu crawler is indeed growing.

1.3. The ranking of core keywords fluctuates, but it is biased and the upward trend is higher. At present, core keywords are in the top 5, which is normal.

1.4 server log analysis, crawler corresponding request_URI value (relative address), temporarily normal, please see below.

1.5. The server log is the log of Aliyun. HTTP request has a small area server 500 access error on July 18, July 19, July 20 and July 26. But at most only a limited time included anomalies, not a large range of not included.

In the analysis of server access log, the general items to be paid attention to are: crawler grasping time value, crawler page URL value, crawler grasping order in page, crawler grasping quantity in time, and spider IP value has the weight of high or low (I am not sure, so I do not refer to).

Page URL value: generally the server log is a relative address, I diagnose the problem is to ignore the host value, the real capture URL should be host+ request_URI value combination.

Page crawling order: it can check the crawling condition of the website architecture, and probably know the crawling order of the crawler in the website page, which can assist the use of crawler software or the development of classic crawler (PY, PHP, etc.) as a reference

Time number of spiders crawling: amount and the period inspection web page to grab quantity proportion, determine the popularity of the website.

Speaking of which, explain the server architecture of the site:

Load balancing is used. File server + data server + front-end server. All data of data server is used by API interface, front-end of GET mode and APP, and website URL is relative address. Naturally, the servers also use Intranet communication.

Request_uri = host; request_uri = host; request_uri = host; The Host value that has been ignored turns out to be API’s secondary domain name (Figure 2)

! [SEO included abnormal diagnosis: Load balancing architecture lead to problems and solutions of SEO SEO optimization SEO promotion 2] zhang (https://images.lusongsong.com/zb_users/upload/2020/08/202008221612_332.jpg)

At this point, you’re probably pretty sure you know why.

Baidu did not grab the real page URL, actually grab API domain name + request_URI,

That is, assume that the data path rendered by the database server API to the front-end is api.name.com, through the internal IP address,

The URL of the captured page is api.name.com/post/1.html

The actual URL should be the external IP address: www.name.com/post/1.html

Now that 30% of the core issues are understood, the next natural step is to prove the data, mainly from a few points.

1. Review the development log

2. Comparison of server logs before and after April

It can be found from 1 that the data server API of 4.13 load balancing is canceled proxy. The result is that the front-end directly captures the data whose host host value is the API domain name in the front-end rendering, because the Intranet IP is directly used without proxy, and the API secondary domain name is the host host value.

It can be found from 2 that the log host value changed from www.name.com to api.name.com around April.

In the end, the problem is that the host API site does not use a proxy, that is, as long as the API site becomes a secondary site rendering of the WWW through a proxy. If no proxy is used, the page returned by Baidu GET is the Intranet IP address, and the URL captured is api.name.com/post/1.html.

Solution:

1. Data server API interface of load balancing uses proxy

2. Add a label to the Head area

3. Use absolute paths for front-end rendering HTML

4. Develop an API interface to push data

In this paper, to the end. As I am only SEO, my operation and maintenance ability is limited. I can configure the standalone server and the next station, but I have only heard about load balancing. Please forgive me if there is any error in operation and maintenance.

Source: Lu Songsong Blogger: Shenzhen legend

Source link: lusongsong.com/reed/13554….