How do crawler engineers bypass captcha? Find the road to the Ardennes

Takeaway: Maginot line is France before world war ii took more than ten years to build fortifications, very strong, but because of the expensive, just built the franco-german border, hundreds of kilometers, and the method than the boundary of the ardennes rugged, difficult movements, and against Belgium in the border defense, solid French didn’t guard against too much again, They expected the strong Maginot Line to hold off the Germans. Later, however, the Germans avoided the front of the Franco-German border, bypassed the Maginot Line through the Ardennes, and then the Anglo-French evacuation of Dunkirk.

Captchas are the Magino line of defense against the crawler’s frontal attack.

With the continuous struggle between crawler and anti-crawler on verification code, the difficulty of verification code recognition is increasing. Now complex captchas look like this:

Front hard just verification code, want to identify it, is a very complex thing, involving image processing technology: binarization, noise reduction, cutting, character recognition algorithm: KNN(K proximity algorithm) and SVM (support vector machine algorithm), more complex point also need to use CNN(convolutional neural network), and what machine learning what.

Although there are currently cryptographic platforms that can solve the vast majority of captcha problems, if the number of crawls is very large, it is not feasible to rely solely on the cryptographic platform. Besides the cost factor, there are also captcha factors that the cryptographic platform cannot solve, such as sliding captcha:

If a frontal attack would take a lot of work, could we find the Ardennes heights that the reptile engineers thought we could bypass the CAPtcha?

In this article around the industrial and commercial website as an example, the common verification code bypass skills to do a small summary, the way to decrypt how to bypass the sliding verification code without the help of simulation JS drag.

Local industrial and commercial websites (full name: National Enterprise credit Information Disclosure System) contain a large number of real information of enterprises, financial loan credit investigation, etc., and naturally attract a large part of the firepower from crawlers, so the anti-crawler measures are particularly strict. General website only in the login registration and other links, or after frequent visits to pop up verification code, and industrial and commercial website query without login, every search keywords need a verification code. At the same time, due to the independent development of various industrial and commercial websites, independent use of a variety of different verification code mechanism, but also to the full crawler added more obstacles. Therefore, industrial and commercial website verification code is particularly representative.

First, start with the simplest paging perspective.

Paging processing can be placed in the front can also be placed on the backend, if only on the back-end, each click page just needs to send a query request, and a verification code usually can only serve a request, the request again need to obtain a new verification code, direct observation flip operation whether pop-up captcha input box can determine whether can bypass.

How do you determine whether paging is on the front end or the back end? F12 opens the browser developer tools, clicks the next page, and a new request is placed in the back end, and vice versa.

Practice found that the industrial and commercial websites in Sichuan and Shanghai are put in the back end, and there is no verification code input box. Note that the verification code can be bypassed, but the bypass principle is somewhat different:

Comparing the parameters before and after the page turning of sichuan Industrial and commercial website, in addition to page number parameters, there is one more item: yzmYesOrNo=no. As can be guessed from the variable name, the back end determines whether the verification code needs to be checked based on the value of this parameter.
And comparison results show that the Shanghai industrial and commercial website in addition to the page parameter, less captcha fields, so we can bold guess: verification code check only on the front end, back end didn’t do the second check, operation is round but from the page, but without a captcha fields directly backend sends the request, the data is received.

Strictly speaking, this is a vulnerability, to treat this vulnerability, crawler engineers should be the attitude is: quietly into the village, do not shoot. But also do not use too hard, the author did not add restrictions when the experiment, climbed one hundred thousand data, IP was blocked. It took a long time to unseal, while the page was revised and the bug was fixed.

Secondly, observe whether the target website has multiple sets of captcha.

Some sites use different captchas on different pages for unknown reasons. Once this is the case, we have to take the soft option of bypassing the complex captchas by starting with a simple captchas and using the recognition as a parameter to send requests to the back end.

For example, hubei Industry and Commerce query page verification code is similar to the following nine grid verification code:

The verification code of the login interface of the electronic business license is as follows:

The latter captchas are obviously easier to identify, as shown in the following example:

Again, consider the storage ID of the data

Wap pages developed specifically for mobile are generally much less restrictive. To Beijing Industry and Commerce website, waP interface without verification code parameters directly send requests can get data, the principle is similar to Shanghai Industry and Commerce website, but it has a strict limit on the number of daily visits to a single IP, so this way can be used but not good.

Continue to observe, take the search “Landscape Group” as an example, from the search page to the list page, you need to enter the verification code, but from the list page to the details page, you do not need to enter the verification code. The final detail page is shown below:

Usually no one remembers the location of the industrial and commercial website, we will go to the search engine search. When searching the credit information of Beijing enterprises, we found two valuable results, in addition to the national enterprise credit information publicity system, there is also a Beijing enterprise credit information network. Go to the latter and do the following to get the details page:

Observation found that the enterprise ID is the same, hey hey, this is not “two brands, a group”! The latter is smaller, although there is a verification code, but can be bypassed in the same way as Shanghai Industry and Commerce, but the returned result field does not quite meet our requirements. However, we can go to the enterprise information network to obtain the enterprise ID, again using the way of grafting, to the information publicity system structure link to obtain the final details of the page!

If you have used a database, you all know that the data id in the database is increased by default. If there is a website that references the data xxx.com? Id =1234567, then it is easy to guess how to construct an ID like 1234568! Usually, there aren’t many websites that can be easily guessed using this id rule, but that doesn’t mean there aren’t any. For example gansu province industry and commerce website, the result page is to take the enterprise registration number to inquire.

Finally, talk about sliding captcha.

At present, the industrial and commercial website has been a comprehensive revision, all the use of sliding verification code, most of the above ideas are invalid. For sliding verification code, the solution that can be found on the Internet is basically to download the picture, restore the picture, calculate the sliding distance, and then simulate JS to drag to solve the problem. Let’s see if we can solve this problem without simulating drag.

Take Yunnan industrial and commercial website as an example, first catch the package to see the process.

1. http://yn.gsxt.gov.cn/notice/pc-geetest/register?t=147991678609, the response:

2. Download the verification code image

3. http://yn.gsxt.gov.cn/notice/pc-geetest/validate, post the following data:

4. http://yn.gsxt.gov.cn/notice/search/ent_info_list, post the following data:

On closer analysis, we found two suspicious points:

The first step does not return the address of the image to download, so how does the front end know which image to download?
In the third step, the backend is not told which images were downloaded. How can the backend verify that the posted data is valid?

Careful reading of the front-end confused JS code, we find that the front-end data processing process is like this:

Take a random integer from 0 to 6 (inclusive) and assign it to d, d=5;
Take a random integer from 0 to 300 (excluding), assign e, e=293;
Convert d into a string and encrypt it with MD5. The first 9 bits of the encrypted string are assigned to f, f=’ e4DA3b7FB ‘;
Convert e to a string and encrypt it with MD5. The encrypted string is assigned 9 bits from the 11th bit to g, g=’43be4f209′;
Take the even bits of f and the odd bits of g to form a new 9-bit string for h, h=’e3de3f70b’;
Take the last 4 bits of h and perform MOD operation with 200. If the result is less than 40, take 40; otherwise, take itself and assign it to x, x=51;
Assign a random number within [x-3, x+3] to C, c=51;
T (c,challenge), t(d, challenge), t(e,challenge)) and (_) join together to create geetest_validate. ‘9ccccc997288_999c9ccaa83_999cc9c9999990d’
Here the f, g parameters determine where to download the image:

And x is the horizontal offset dragged by the slider, thus answering the question of how to obtain the image download address in Question 1.

The t encryption process mentioned above looks like this:

T (a,b), shown here as a=51

Challenge for 34 hexadecimal string, before taking a 32-bit assigned to the prefix, after two assigned to suffix, prefix = ‘34173 cb38f07f89ddbebc2ac9128303f’ suffix = ‘a8’
The prefix to weight and keep the original order, get a list [‘ 3 ‘, ‘4’, ‘1’, ‘7’, ‘c’, ‘b’, ‘8’, ‘f’, ‘0’, ‘9’, ‘d’, ‘e’, ‘2’, ‘a’]
Drop the 2 lists into a list of 5 sublists in a circular order to get random_key_list: [[‘3’, ‘b’, ‘d’], [‘4’, ‘8’, ‘e’], [‘1’, ‘f’, ‘2’], [‘7’, ‘0’, ‘a’], [‘c’, ‘9’]]
Convert the suffix string (hexadecimal) bit by bit to base 10 to get [10, 8]
Multiply and sum the lists in 4 bit by bit with [36,0] and add to the rounding result of A, n= 51 + 36*10 + 0*8=419
,2,5,10,50 q = [1], using qq for n decomposition (n = 50 * 13 + 10 * 0 + 5 * 1 + 2 * 1 + 1 * 1), the factor of reverse order assigned to p, p =,2,1,1,8 [0]
Sub_key =’ 9ccCCC997288 ‘sub_key=’ 9ccCCC997288′

Now that we’ve done the analysis, we’ve discovered something else:

According to the packet capture process above, there is no need to actually download the image, just perform steps 1, 3 and 4 to get the target data. Step 1 can also be discarded, only 3 and 4 are needed, but if step 1 is missing, we also need to request an additional cookie, so we still want to keep step 1, so as to disguise it a little bit better.
The same challenge provides a different validate and seccode for each operation, so how does the server validate the other data from the challenge?

To sum up, this paper only provides a new way of thinking about captcha, which makes use of a small omission in the website development process, and the final sliding captcha only analyzes the verification method of offline mode. Don’t expect all captchas to be bypassed. Without the Ardennes, would Germany not have attacked France in WW2? As long as you think it’s worth it, even if you’re facing captchas head-on, your advice as a reptilian engineer is: Don’t be a pussy, just do it!

Recommended reading

Why can’t a lot of websites crawl? Six common ways for reptiles to break the seal
A must-have tool stack for startup teams
Pain points and model improvement methods of machine learning platform: Introduction of Spark based machine learning platform in Risk control of Dianrong.com

The author of this article is hou lei, published by DianrongMafia public account (wechat id: DianrongMafia) authorized high availability architecture, original technical and architectural practice articles, welcome to submit through the menu of the public account “contact us”.

Highly available architecture

Change the way the Internet is built

Long press the QR code to follow the “HA Architecture” public account

How do crawler engineers bypass captcha? Find the road to the Ardennes

Related Posts

The Node v16.2.0 release

MQTT protocol client development introductory point north

【 asynchronous functional programming 】CompletableFuture