Scene:

“You explain, a total of how much data captured, in which websites captured, data for what purpose? See enough to stay in it for a few years…” The policeman said solemnly to Xiao Zhang.

Programmer Xiao Zhang, in a big data credit enterprise company as a reptile engineer, once had a joke: herpetology is good, prison early!

Quote:

Of course, the above example is only a joke. In Internet companies, crawler engineers catch a lot of crawlers. Due to the existence of data demand, the Internet industry recognizes the existence of crawlers.

From the dialectical point of view, reptiles are malicious and play a negative role; There are also good ones, ones that will benefit you. Therefore, we should stand at a certain height, see the essence through the phenomenon, and grasp the first principles.

Little Dove deeply reflected on three questions to understand the origin and progression of reptiles for more people to learn.

(1) What is a reptile and why is it used?

(II) Is it really illegal to be a reptile engineer?

(3) How to solve the common technical difficulties of crawlers?


(1) What is a reptile and why is it used? 1) What is a reptile

To put it simply, a crawler is a detection machine, whose basic operation is to simulate human behavior to wander around various websites, click buttons, check data, or recite the information. Like a bug crawling around a building tirelessly.

You can easily imagine that each reptile is your doppelganger. It’s like sun Wukong plucking out a handful of hair and blowing out a bunch of monkeys.

You use baidu every day, in fact, is the use of this crawler technology: every day to release countless crawlers to each website, their information back, and then make up the platoon waiting for you to search; For example, the ticket information of major OTAs can be retrieved from the official websites of various airlines to integrate domestic and international ticket prices. Then you can search for price comparison and choose the most satisfactory flight. For example, grab tickets software, it is equivalent to spread out countless doppelgants, each doppelgant help you constantly refresh the 12306.cn train tickets. As soon as I found a ticket, I would take a picture of it and yell to you: Tuhao come and pay.

2) Reptiles have good and evil

Like Baidu such search engine crawler, every few days to sweep the whole web page, for everyone to refer to, each site is swept mostly very happy. This is defined as a “goodwill crawler”. But crawlers like Ticket-snatching software are jerking off to 12306 tens of thousands of times per second. Tie zong does not feel very happy. This is defined as a “malicious crawler.” (Note: grab tickets you feel happy useless, the site being scanned feel unhappy, will increase the server pressure, it is malicious.)

What is shown in this chart is the proportion of all walks of life. Behind each color block, there is a real and powerful chain of interests. \

Next, the little pigeon tells you about the interests behind why reptiles exist.

The first place is travel, and the proportion of crawlers in the travel industry is the highest (20.87%). In the travel module, 89 percent of the traffic is directed to 12306. It’s no surprise, the only one in China to buy train tickets. Remember when 12306 online Wang Luodan and White Lily “the history of the most pit picture verification code”?

These things are not to deliberately make trouble for honest ticket sellers, but just to prevent the click of the crawler, just said, crawler will only be simple mechanical click, he does not know White lily, so in the face of such a difficult verification code, how to snatch tickets software recognition?

There is a thing called “play code platform”, baidu can understand.

Code platform employed many uncle aunt, they don’t do other things in front of my computer screen, specialized gang identification authentication code, there met captcha rob ticket software, the system will automatically put the verification code pushed to the front of uncle aunt, they are chosen by hand which is white lilies which is no wang Dan, and then the results back in the past. It took no more than a few seconds. Of course, such a coding platform has a memory function, if uncle or aunt has marked this picture is “rice cooker”, then the next time this picture appears, the system will directly judge it is “rice cooker”. For a long time, the 12306 system’s library was marked over, the machine itself can recognize, do not need uncles and aunts.

This is one of the benefit chains. The coding platform is a third-party platform, typically such as: Ruokuai, in order to obtain the train ticket data of 12306, the ticket grab software will access Ruokuai, identify a verification code calculated according to 0.2 yuan to Ruokuai, those uncles and aunts registered with Ruokuai account, will receive the verification code pushed over, ruokuai in accordance with 0.1 yuan settlement to uncle and aunt, which has formed a huge benefit chain. Of course, if fast is based on the accuracy of uncle and aunt identification, speed, volume and other aspects, to choose push, some uncle and aunt do well, a day 600 no problem.

You may ask, how does snatch ticket software make money, there are too many ways to cash, one of them, as a ticket buyer, can pay to book a certain period of train tickets, once found a ticket, snatch ticket software immediately grab, in addition to charging your booking fee, but also additional insurance, or direct no, can connect.

Why is 12306 so stingy, you ask? Would it die to let a reptile crawl freely? Answer: die. Do you know what 12306 looks like every year before Chinese New Year? Here’s what the public data says: “At its peak, 81.34 billion page views were recorded in one day, and 5.93 billion clicks were recorded in one hour, averaging 1.648 million per second.” And with captcha protection, you can imagine how many more crawlers are out there. This is why every Spring Festival, when you go to 12306 to grab a ticket, the first second you see the ticket is still available, the second second there is no ticket. That’s why the iron Club invited Ali to be its technical consultant. The same and iron general as the tragic brother, is aviation. The country’s four major airlines, the Southeast Sea, all suffer from crawling, of course the worst is international airlines AirAsia.

Many of you may not have flown on AirAsia, the Malaysian budget carrier that flies mostly from China to southeast Asian tourist destinations, where even mineral water is a must for the poor.

Why do reptiles like AirAsia so much? Because it’s cheap, or rather, because it often gives out cheap tickets. Airasia’s original intention was to randomly release cheap tickets to attract tourists, but scalpers were profitable.

Technology scalpers use crawlers to refresh airAsia’s ticketing interface and snap cheap tickets whenever they appear. Airasia has a policy that if you don’t pay for half an hour (I can’t remember the exact time), you will automatically return to the pool and continue selling tickets. But the scalper wrote the exact time in the crawler script, and at half an hour, a millisecond, he took the ticket again, and so on. Until someone orders the ticket from the scalper, the scalper then uses the process to discard the ticket in the AirAsia system and, 0.0001 seconds later, book it for you in your name.

“I’m a middleman, and I’m going to make the difference!” This wave operation is perfect.

Of course, some airlines are also willing to let you go to climb, especially those unknown small airlines, the original website traffic is not high, visibility is not high, after your crawler, but can improve his influence, these need to be treated dialectically.

Pigeon has also climbed the ticket price data and check-in data of major domestic and international airlines. If OTA has a good relationship with airlines, they can directly access the airline interface, or they can climb the official website data through various channels, including official account, mini program, APP, PC official website and Web page.

Any profitable corner is likely to have the shadow of a reptile.

Conclusion:

Reptiles are profit-oriented, and they will always go where there is profit. And where a reptile finds interest, it is often a hidden pain that we cannot bear to mention. Some say technology is guilty, some say technology is not guilty. Guilty or not, we need to use reptiles wisely. It is useless to complain about the world, but to act with both hands and make reasonable use of technology means to create a better world is the right way.


Note: there are three lectures on this topic. This lecture is only the first one.