This article is from huawei cloud community “Python crawler anti-crawler, you should start from this blog, UA anti-crawler, Cookie specific parameters anti-crawler,” by: dream eraser.

As you may have noticed, crawlers are machine access to the target site, and from the target site’s point of view, the crawler traffic is ** “garbage traffic” and completely worthless (with the exception of crawlers).

In order to shield these spam traffic, or to reduce the pressure on their servers and avoid being affected by crawlers to normal human use, developers will study various means to anti-crawler.

There is a symbiotic relationship between crawler and anti-crawler. If there are crawler engineers, there must be anti-crawler engineers. In many cases, crawler engineers and anti-crawler engineers are fighting with each other.

There is no specific classification of anti-crawling, if a website has anti-crawling code, generally several anti-crawling measures will be used together.

Server validates request information class crawler

This series of blogs will start with the simplest anti-crawling method, entry-level anti-crawling: “user-agent” anti-crawling.


User-agent (User-Agent) represents the browser-related information of the User. The anti-crawling logic verifies user-Agent parameters in the request header by the server side, and then distinguishes crawler from normal browser access.

Go to any website, wake up the developer tools, and type navigator.userAgent in the console to get the UA string.

The UA string format can be understood as follows:

Platform engine Version Indicates the browser versionCopy the code

If detailed decomposition, the following format can be obtained:

Browser identity (operating system identity; Encryption level; Browser language engine Version Indicates the browser versionCopy the code

This makes it easier to understand what it means when you look at the picture above.

Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36Copy the code

When you test on different browsers, you’ll see that UA strings start with Mozilla, a legacy of the browser wars of the past.

Here is a comparison of the UA strings of the three major browsers in the market.

# Google Browser Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 # Firefox /5.0 (Windows NT 6.1; Win64; x64; Rv :94.0) Gecko/20100101 Firefox/94.0 # IE11 Mozilla/5.0 (Windows NT 6.1; WOW64; Trident / 7.0; SLCC2; CLR 2.0.50727; CLR 3.5.30729; CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; The rv: 11.0) like GeckoCopy the code

Analyze the meaning of relevant data in the above content

  • Mozilla/5.0: indicates a browser.
  • Windows NT 6.1: Operating system, what I got here is Windows 7 operating system;
  • Win64/WOW64:64-bit operating system;
  • X64: Release;
  • N, I, U: encryption level, not shown here;
  • AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36: This if you go to research, there are also a lot of interesting, but we understand it is the browser version can be.

Once we have a basic understanding, we can write different browser identifiers at will (most of the time copying them directly from developer tools)

In turn, the server can also identify information about the browser accessing it from the string (in fact, the operating system information is also carried over, and it can even verify that the UA field is composed of specific rules).

Case operation link

If you do not set the UA field, you will not get any return data. You can set the following headers to null and see the result.

Import requests headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1); Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/95.0.4638.69 Safari/537.36"} res = requests.get('', headers=headers) print(res.text)Copy the code

The user-agent

You can use the Python third-party library PIP install fake_userAgent, or you can maintain a UA class yourself. The HOST and Referer parameters are the same as the user-agent parameters, which can be considered as setting some information for reverse crawling.

Cookie the crawler

The use of Cookie authentication is also a common anti-crawl. Since the target site can not be found, the following content explains from the theoretical level, and will carry out practical operation in combination with complex cases later.

Cookie is the simplest way to anti-crawler

The server side uses special Cookie values for validation and does not return data if the passed Cookie value is found to be nonexistent or does not conform to the generation specification.

For example, the server validates the fixed Cookie field. In the hot list code above, if you do not carry some Cookie values, you will not get complete data (you can test yourself, the difference value is username).

There is also a case to verify whether the Cookie conforms to a certain format, for example, the Cookie is dynamically generated by JS, and compound some potential (developer agreement) rules, then after the Cookie value is transferred to the background, the background engineer can directly verify the value to achieve the reverse crawling effect. For example, if the Cookie rule is 123ABC123, the first three random numbers, the last three random numbers, and the middle three random lowercase letters, the background engineer can verify whether the Cookie value transmitted by the client is a compound rule through the regular. If it does not meet the rule, the background engineer can directly return an exception message.

Of course, this method is easy to be identified, and the time stamp can be further added. After the background engineer gets the time stamp in the Cookie, the difference of the current time can be verified. If it exceeds a certain value, the Cookie can also be considered to be forged.

Cookies are also used for user identity verification. For example, the data of many sites can only be accessed after login, because cookies record user information. There are many application scenarios of cookies, such as the system message page of Huawei cloud blog…

The login page will be redirected, but if you carry cookies in the request header, you will get the corresponding content, among which the most important Cookie field is HWS_ID, the test code is as follows, you can copy the corresponding Cookie field from the developer tool to access the page.

Import requests from LXML import etree headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36", "cookie": 'your HWS_ID cookie value; ' } res = requests.get('', headers=headers, allow_redirects=False) with open("./1.html", "w", encoding="utf-8") as f: f.write(res.text) elements = etree.HTML(res.text) print(elements.xpath("//title/text()"))Copy the code

Click to follow, the first time to learn about Huawei cloud fresh technology ~