Some problems needing attention in web crawler design

Doing web crawler is a very meaningful thing. First, it can be a specialized profession. From the corporate level, business and strategy may need a lot of data for multidimensional analysis, so now many companies have specialized crawler engineers to design data acquisition systems; Secondly, many companies rely on crawlers, such as search engines and the recently popular Jike APP, to make profits. Finally, crawlers can also be a good toy for programmers to earn extra money in their spare time. Worst of all, it can also be a good toy for programmers to grab some interesting pictures and articles and make a Side Project they like.

I came into contact with reptiles by reading articles on “Quiet Find”. The author recently wrote a book titled Python3 Web crawler Development, which is one of the most systematic crawler books on the market. I also write something to summarize the main problems encountered in the process of crawler, hope to have reference significance for the students who have not been in contact with, but also hope that the old birds help to see whether the way is correct. This paper is mainly to clarify the idea of crawler operation, not involving too much specific code.

A web crawler, also known as a web spider, is actually an automated web robot that gets information on the Internet instead of human beings. So as long as the steps of users to obtain network information are restored, the whole context of crawler operation can be clarified.

Site management

When SURFING the Internet, I first input a url, the server returned to me the results of the web page, I encountered I am interested in the article drag with the mouse, the browser will automatically give me a new label, a minute later, I got the home page I need all the content. I have another option. When I come across an article that I’m interested in, I click on it, and then I see an article that I’m more interested in, I click on it again, and then I go back to the home page for a second article that I’m interested in.

The first strategy is called “breadth first” and the second “depth first”. In practice, the breadth-first strategy is generally adopted. We take the URL we are interested in from the data returned by the entry and place it in a list. After each URL is climbed, we place it in the completed list. For abnormal, additional marks for subsequent processing.

In fact, the simplest crawler only does one thing: access addresses and retrieve data. When the number of addresses to access becomes too large, set up a URL manager and mark all urls that need to be processed. When the logic is not complex, you can use data structures such as arrays, and when the logic is complex, you can use databases for storage. One advantage of database records is that when a program unexpectedly hangs, you can continue with the ID number you are working on without having to start over and crawl through the URL you have already worked on. Using Python3 as an example, write the following pseudocode:

def main(a):
    root_url = 'https://www.cnblogs.com'
    res = get_content(root_url)
    first_floor_urls = get_wanted_urls(res)

    for url in first_floor_urls:
        res_url = get_content(url)
        if sth_wrong(res_url):
            put_to_error_list(url)
        else:
            sencond_floor_urls = get_wanted_urls(res_url)
    # rest of the code
            
if __name__ == '__main__':
    main()
Copy the code

What languages can be crawlers

Although I don’t speak many languages, I believe that any language with a standard library to access the web can easily do this. When I first got in touch with crawlers, I always struggled with using Python to do crawlers, but now I think it is unnecessary. No matter JAVA, PHP or other lower-level languages, it can be easily implemented. Static languages may be less error-prone, low-level languages may run faster, and Python has the advantage of richer libraries. Frameworks are more mature, but it actually takes time for beginners to become familiar with libraries and frameworks.

For example, I contacted Scrapy, matched with the environment for two days, for the complex structure of the cloud is in a fog, then I gave up decisively, any crawler I only use a few simple libraries to achieve, although it took a lot of time, but I have a deeper understanding of the whole HTTP process. I think:

Blindly learning frameworks without understanding the benefits of design is a hindrance to technological progress.

In the early days of my Python career, I spent a lot of time in the community reading things like Flask, Django, Tornado, and even Bottom Sanic. Many of these articles were well written and I learned a lot from them, I learned that Flask is much more flexible, Django is bigger and more comprehensive, etc.

But seriously, it’s wasting a lot of my time. Newbies have a tendency to spend a great deal of energy looking for one-size-fits-all methods, languages and frameworks, in the belief that they will be safe for a long time in the face of various challenges. If I had to do it all over again, I would choose to read one or two of these excellent comparative articles, then boldly choose one of the mainstream frameworks, try others in unimportant study projects, and discover their strengths and weaknesses after using them a few times.

Now I also see this tendency not only among new players, but also among older players. They see the media trumpeting Go and Assembly, everyone talking about microservices and React Native, and they don’t know why. But there are people who really understand the advantages of these technologies, who try them out in the right context, and then step by step, and apply them to their major businesses. I really admire the people who are leading the way and never being pushed by the new technology.

Analytical data

It should be called Parsing web pages, but since most of the data is on mobile these days, parsing data is more appropriate. Parsing data means when I visit a web site and the server returns the content to me, how do I extract the data I need. When the server returns me HTML, I need to extract the content under the specific DIV; When the server returns ME XML, I also need to extract the content under a tag.

The original approach is to use “regular expressions”. This is a common technique, and most languages have libraries for it. In Python, the re module is the counterpart, but regular expressions are so difficult to understand that I don’t want to use them unless I have to. BeautifulSoup and Requests-HTML in Python are great for extracting content through tags.

Dealing with anti-crawler strategies

Crawlers are a huge resource load on servers. Imagine buying a virtual cloud server for $30 a month from a cloud service provider and setting up a small blog to share your technical articles. Your article is very good, so many people want to visit it, so the server is slow to respond. Some people start crawlers to access your blog, and in order to implement updates, the crawlers will frantically visit your blog hundreds of times per second, and at this point no one will ever be able to get content from your blog again.

That’s when you have to find a way to contain the crawlers. There are many strategies for the server to contain crawlers. Each HTTP request takes a lot of parameters, and the server can judge whether the request is malicious crawler or not according to the parameters.

For example, Cookie values are incorrect, Referer and User-agent values are not desired by the server. We can experiment with the browser to see which values are acceptable to the server, and then modify the parameters in the request header in code to look like normal access.

In addition to the fixed request header parameters, the server may also customize some parameters to validate access, which is especially common on the app side. The server may ask you to generate a key using a set of parameters, such as a timestamp, and send it to the server to verify that the key is valid. In this case, you need to study key generation, or if you can’t, you can impersonate the user completely with a mock browser and virtual machine.

The server also limits IP addresses, limiting the speed of IP access. For example, if I use the machine with IP of 45.46.87.89 to access the server, once the server thinks that I am a crawler, it will immediately join the blacklist, and my access will be completely invalid from the next time. Most IP restrictions are less stringent, but it is common to limit access speeds, such as a server’s limit of 40 accesses per IP within an hour.

This requires the crawler designer to pay attention to two things:

Cherish server resources, do not be too violent access to server resources
Pay attention to the IP proxy pool design

Designing too fast access speed is an immoral behavior and should not be encouraged. The server may react quickly and design the anti-crawler strategy more strictly after being violently accessed by crawlers. Therefore, I never design the crawler speed too fast and sometimes delay the next crawler by 1 minute. I’ve always believed that free access to other people’s content should also be cherished.

Don’t forget to hide your real IP to protect yourself when designing crawlers. In the IP proxy pool, the IP address is changed for each access to avoid being blocked by the server. There are many free pools of agents on the web that you can crawl off and store for later use. There are a number of libraries out there such as Proxy_pool that are very useful. Once installed, access the local address to get a list of available IP addresses.

Crawlers and anti-crawlers fight for a long time, but there are other problems, such as captchas. Different captcha have different processing methods, common coping strategies include buy – pay authentication services, image recognition and so on.

Other specific problems can be analyzed by using the “package capture tool”. Charles and Fiddler are the most commonly used package grabber. It is also very easy to use. On the command line, I used mitmProxy, with a fancy name, “man-in-the-middle attack.” I also tried Wireshark, which is much more complex to use, but the entire access process is not let go. If you have the energy to learn how the network links and Wireshark network analysis is so simple, you should read these two books to understand the network access.

Packet capture tool is very useful, not only for crawler analysis, but also for network attack and defense practice. I used Fiddler to discover many bugs in a mainstream fitness software, but they soon discovered them. They informed me that there would be a reward for submitting bugs through their official channels. I did, but I didn’t get any reward or reply from them. As you can see, not all big companies are reliable.

The simulator

The design of crawler also needs to pay attention to a very cruel situation: the Web side is becoming more and more JS, and the key value verification on the mobile side is becoming more and more complex so that it can not be cracked. The only option is the emulator to fully impersonate the user.

A common tool for emulating the browser on the web is Selenium, an automated testing tool that controls the browser to click, drag, and otherwise operate the browser in place of a human, often paired with PhantomJS.

PhantomJS is a WebKit-based server-side JavaScript API released under the BSD open source protocol. PhantomJS supports the Web without browser support and natively supports various Web standards such as DOM handling, JavaScript, CSS selectors, JSON, Canvas, and the scalable vector graphics SVG. But maintenance seems to have been stopped for now.

That’s okay, Selenium also works in browsers like FireFox and Chrome. Do it again if you need to.

In addition to web, mobile apps can also use simulator technology to completely simulate human actions. I’ve used UIAutomator, and Appium is so powerful that I haven’t used it yet.

When concurrency is needed, and we don’t have enough real machines to crawl on, we use a virtual machine like GenyMotion, which works the same as a Linux virtual machine, just download the installation package and configure it.

Concurrency and distribution of crawlers

Python has no real advantage as a concurrent crawler, but as mentioned earlier, a highly concurrent crawler has too much impact on someone else’s server for a smart person to go unrestricted, so a highly concurrent language has no real advantage. Aiohttp is also very useful with async/await syntax since Python 3.6.

As far as distribution is concerned, I haven’t studied it, and most of my crawlers aren’t up to this level. I have worked with distributed storage and mongodb is not too hard to cluster.

conclusion

Reptiles are a simple thing to say. But often simple things need to be overcome to do the best. The main things I can think of to do a good reptile are:

URL management and scheduling. Intelligent design is often highly tolerant of errors, and the damage caused by a crawler failure is minimal.
Data parsing. Learning regular expressions is always a good thing.
Restrict anti-crawler policies. Requires a certain understanding of HTTP, preferably a systematic study.
The simulator. It’s a little inefficient, and the computer can’t do anything else.

I like designing crawlers very much, and I will try to design a universal crawler in the future. This article did not write specific code, because I see the source code on the Internet are very good to understand, I will not do repeated things. I collected a few of them when I was learning crawlers. They are Python. If you are interested, you can ask me for them.

Some problems needing attention in web crawler design

Site management

What languages can be crawlers

Analytical data

Dealing with anti-crawler strategies

The simulator

Concurrency and distribution of crawlers

conclusion

Related Posts

Pick up the keyboard: Develop a distributed IM system freehand with me

Implementation of SpringBoot upload and download

Understanding the JVM part 5: The Java Memory Model and threads