Learning goals
  1. Understand the causes of server crawl
  2. What kind of crawler does the server often react to
  3. Understand some common concepts in anti-crawler field
  4. Know the three directions of reverse climbing
  5. Learn about common identify-based reverse crawling
  6. Learn about common crawler based crawler behavior
  7. Understand common anti-crawl based on data encryption

1 Causes of server crawling

  • Crawler accounts for a high proportion of total PV(PV refers to the number of page visits, every page opened or refreshed, even a PV), which wastes money (especially crawler in March).

    What is the concept of a march reptile? In March every year, we will meet a crawler peak, and a large number of masters will choose to crawl to the website and conduct public opinion analysis when writing papers. Because the paper is due in May, so everyone has read the book, you know, early various DotA, LOL, march is coming, too late, hurry to grab the data, analyze in April, submit the paper in May, that’s the rhythm.

  • Companies can be free to query the resources are seized in batches, the loss of competitiveness, so less money.

    Data can be queried directly in a non-logged state. If forced login, then you can block the account to make the other side to pay the price, which is also the practice of many websites. But do not force each other to log in. If there is no anti-crawler, the other party can copy the information in bulk, and the competitiveness of the company will be greatly reduced. Competitors can get hold of the data, and over time users will know that they just need to go to competitors, there is no need to come to our website, which is bad for us.

  • It’s a long shot to Sue a reptile

    Reptiles are still a marginal issue in China, which may be successfully prosecuted or completely ineffective. So still need to use technical means to do the final guarantee.

What kind of crawler does the server often detect

  • A very low grade fresh graduate

    Fresh graduate crawlers are often simple and crude, regardless of server pressure, coupled with unpredictable numbers, it is easy to crash the site.

  • A very low-class start-up

    Now there are more and more startups, and I don’t know who is fooling them. Then people start their own businesses and they don’t know what to do. They think big data is hot, so they start doing big data. The analysis program is almost written, and I find myself without data. How to do? Write about crawlers. So there are countless crawlers, crawling through data for the sake of survival.

  • Accidentally misspelled an out-of-control reptile that nobody stopped

    Some sites have already done reverse crawling, but crawlers continue to crawl tirelessly. What does that mean? That is, they can’t crawl any data at all, except httpCode is 200, everything is wrong, and yet the crawler doesn’t stop. This is probably some crawler hosted on some server, unclaimed, still working hard.

  • Formed business rivals

    This is the biggest rival, they have technology, money, want anything, if fight with you, you have to fight with him.

  • Crazy search engine

    We do not think that search engines are good people, they also have a flak, and a flak will lead to server performance decline, the volume of requests and network attacks are no different.

3 some common concepts in anti-crawler field

Since anti-crawler is a relatively new field for the time being, some definitions need to be made by ourselves:

  • Crawler: A method of obtaining web site information in bulk, using any technical means. The key is batch.

  • Anti-crawler: Use any technical means to prevent others from obtaining their site information in bulk. The key is also in volume.

  • Error: In the process of anti-crawler, common users are wrongly identified as crawlers. An anti-crawler strategy with a high rate of friendly fire cannot be used, no matter how effective it is.

  • Interception: Successfully blocking crawler access. There will be the concept of interception rate. Generally speaking, the higher the interception rate of anti-crawler strategy, the higher the possibility of accidental injury. So there’s a tradeoff.

  • Resources: the sum of machine costs and labor costs.

Bear in mind that labor costs are also resources, and more important than machines. Because, by Moore’s Law, machines are getting cheaper and cheaper. And according to the trend of the IT industry, programmers are getting more and more expensive. As a result, it’s often the reptilian engineer’s job to work overtime, and the cost of the machine is not particularly valuable.

4 Three directions of reverse climbing

  • Reverse crawl based on identification

  • Reverse crawling based on crawler behavior

  • Reverse crawl based on data encryption

5 Common anti-crawl based on identity identification

1 Use the headers field to reverse crawl

Headers has many fields that can be taken by the other server to determine whether it is crawler or not

1.1 Use the User-Agent field of Headers to reverse crawl

  • Crawler does not have user-Agent by default, but uses module default Settings
  • Solution: Add user-agent before request. A better approach is to use a user-agent pool (collect a bunch of user-agents, or generate them randomly).

1.2 Reverse crawl through the referer field or other fields

  • Crawler does not carry the referer field by default. The server can judge whether the request is legitimate by judging the source of the request
  • Solution: Add the referer field

1.3 Reverse Crawl through cookies

  • Reason for reverse crawling: Reverse crawling is performed by checking cookies to see if the user who initiated the request has the appropriate permissions
  • Solution: Perform simulated login and perform data crawl after obtaining cookies successfully

2 Reverse crawl by requesting parameters

There are many ways to obtain request parameters. When sending a request to the server, it is often necessary to carry the request parameters. Usually, the server can judge whether it is crawler by checking whether the request parameters are correct

2.1 Retrieving request data from HTML static files (Github login data)

  • Reverse crawl cause: Reverse crawl by making it harder to get request parameters
  • Solution: Analyze each captured packet and figure out how the requests relate to each other

2.2 Obtain request data by sending a request

  • Reverse crawl cause: Reverse crawl by making it harder to get request parameters
  • Solution: Analyze each captured packet, figure out the connections between requests, and figure out where the request parameters came from

2.3 Generate request parameters using JS

  • Backcrawl principle: JS generates request parameters
  • Solution: analyze JS, observe the implementation process of encryption, obtain the results of JS execution through Js2py, or use Selenium to implement

2.4 Reverse crawl by Verification code

  • Creeping principle: The peer server displays a verification code to forcibly verify the browsing behavior of the user
  • Solution: Coding platforms or machine learning methods to identify captcha, which coding platform is cheap and easy to use, more recommended

Common crawler behavior based on reverse crawling

1 Based on frequency or total number of requests

The behavior of crawler is obviously different from that of ordinary users. The request frequency and times of crawler are much higher than that of ordinary users

1.1 Reverse crawling is performed based on the Total number of IP/Account Requests per unit time

  • Anti-crawler principle: Normal browser requests website, the speed is not too fast, the same IP/account requests a large number of servers, there is a greater possibility to be identified as crawler
  • Solution: the corresponding problem can be solved by purchasing high-quality IP/purchasing multiple accounts

1.2 Reverse crawl through the interval between the same IP/account requests

  • Principle of anti-crawling: normal people operate browsers to browse websites, and the time interval between requests is random. However, the time interval between two requests before and after crawler is usually relatively fixed and short, so it can be used to do anti-crawling
  • Solution: Perform random waiting between requests to simulate real user operations. After adding an interval, try to use the proxy pool to obtain data at a high speed. If the request is an account, set random sleep between account requests

1.3 Setting a Threshold for the Number of REQUEST IP Addresses/Accounts per Day To Perform Anti-crawl

  • Creeping principle: Normal browsing behavior, the number of requests per day is limited, usually beyond a certain value, the server will reject the response
  • Solution: Purchase high-quality IP/multiple accounts and set random sleep between requests

2 Reverse crawl according to the crawl behavior, usually in the crawl step analysis

2.1 Reverse crawl through JS jump

  • Anti-crawling principle: JS page jump, can not get the next page URL in the source code
  • Solution: Repeatedly capture packets to obtain strip URL and analyze the rule

2.2 Obtain crawler IP (or proxy IP) through honeypot (trap) for reverse crawling

  • Anti-crawling principle: in the process of crawler acquiring links for request, crawler will extract subsequent links according to regular, xpath, CSS and other methods. At this time, the server side can set a trap URL, which will be acquired by extraction rules, but normal users cannot, so that crawler and normal users can be effectively distinguished
  • Solution: After writing the crawler, use the proxy batch crawl test/carefully analyze the response content structure to find the trap in the page

2.3 Reverse crawl through fake data

  • Creeping principle: Contaminate the database by adding fake data to the returned response, usually hidden from normal users
  • Solution: Run for a long time. Check the correspondence between the data in the database and the data on the actual page. If any problem exists, analyze the response content carefully

2.4 Blocking the Task Queue

  • Anti-crawler principle: by generating a large number of garbage urls, the task queue is blocked and the actual work efficiency of crawler is reduced
  • Solution: observe the operation process of the request response status/carefully analyze the source code to obtain garbage URL generation rules, url filtering

2.5 Blocking Network I/OS

  • Anti-crawling principle: the process of sending request and getting response is actually the process of downloading. The URL of a large file is mixed into the task queue. When the crawler makes this request, it will occupy network IO, and if there are multiple threads, it will occupy threads
  • Solution: observe the crawler running state/multithreading to request thread timing/send request money

2.6 Comprehensive AUDIT of o&M Platform

  • Anti-crawler principle: comprehensive management is carried out through operation and maintenance platform, and compound anti-crawler strategy is usually adopted, and multiple means are used simultaneously
  • Solution: carefully observe and analyze, run the test target website for a long time, check the speed of data collection, and process in many aspects

7 Common anti-crawl based on data encryption

1 Specialize the data contained in the response

The general specialization is CSS data offset/custom fonts/data encryption/data images/special encoding formats etc

The following image is from cat’s Eye Movie computer edition

  • Use your own font file
  • Solution: Switch to mobile version/parse font files for translation

1.2 Using CSS to reverse crawl the following image from cat’s Eye go where computer version

  • Source data is not real data, you need to generate real data through CSS displacement
  • Solution: Calculate the CSS offset

1.3 Use JS to dynamically generate data for anti-crawl

  • Anti-crawl principle: dynamic generation through JS
  • Solution: analyze key JS, obtain the data generation process, simulate the generation of data

1.4 Reverse crawl through data and pictures

  • 58 tongcheng short rental](baise.58.com/duanzu/3801…)
  • Solution: Use an image parsing engine to parse data from images

1.5 Reverse crawl through encoding format

  • Anti-crawling principle: default encoding format is not applicable. After obtaining the response, crawler usually uses UTF-8 format to decode, and the decoding result will be garbled or error
  • Solution: according to the source code for multi-format decoding, or the real decoding format

summary

  • Master common anti – crawling means, principles and coping ideas