“Getting started” is a good motivation, but it can be slow. If you have a project in hand or in mind, you will be goal-driven when you practice it, rather than learning it as a module. Practice leads to true knowledge, and learning without use is not only inefficient in learning, but also quickly forgotten. Use what you learn.

In addition, if every point in the knowledge system is a point in the graph, and the dependencies are edges, then the graph must not be a directed acyclic graph. Because the experience of learning A can help you learn B. Therefore, you do not need to learn how to “get started”, because there is no such “getting started” point! What you need to learn is how to do something big, and in the process, you’ll learn what you need to learn very quickly. Of course, you could argue that you need to know Python first. How else would you learn to crawl in Python? But in fact, you can learn Python in the process of building this crawler: seeing how many of the previous answers talked about the “tricks” — how to use what software to climb, let me talk about the “tao” and “tricks” — how the crawler works and how to implement it in Python.

Let’s make it short: You need to study

2. Basic HTTP crawlers, scrapyBloom Filter: 3.Bloom Filters by Example 4. If you need large-scale web crawling, you need to learn the concept of distributed crawlers. It’s not that mysterious, you just need to learn how to maintain a distributed queue that all clustered machines can effectively share. The simplest implementation is python-rq: github.com/nvie/rq 5. The combination of rQ and Scrapy: darkrho/ scrapy-Redis · 6.GitHub follow up processing (Grangier/Python-goose · GitHub), storage (Mongodb)

The following is a short sentence long say: talk about the original write a cluster climb down the entire experience of Douban.

1) First you have to understand how a reptile works. Imagine you are a spider, and now you are placed on the Internet. Well, you need to look at all the web pages. What to do? M: So you can start with the first page of the People’s Daily. The front row is 762, the middle row is 459, and the back row is 510. Put the above three groups of letters together in order. She will arrange to learn.

On the front page of the People’s Daily, you can see the various links that that page leads to. So you happily crawled from the “National News” page. Great, so you’ve already climbed two pages (home page and national news)! Regardless of what happens to the page you climb down, imagine that you copied the entire page into HTML and put it on your body.

Suddenly, on the national news page, there’s a link back to the “home page.” As a smart spider, you know you don’t have to crawl back, because you’ve already seen it. So, you need to use your brain to save the addresses of pages you’ve already seen. This way, every time you see a new link that you might need to crawl, you check to see if you’ve already been to that page in your head. If you have, don’t.

Well, in theory, if all the pages are reachable from initial page, then you should be able to crawl all the pages.

So how do you do that in Python? Very simple

\

It’s already pretty pseudo code.

Backbone of all crawlers is here, and here’s why crawlers are actually quite complex — search engine companies usually have an entire team to maintain and develop them.

2) efficiency

If you just processed the above code and ran it directly, it would take you a whole year to crawl down the entire contents of Douban. Not to mention that search engines like Google need to crawl down the entire web.

What’s the problem? There are so many pages to climb, and the code is so slow. Assuming that there are N sites on the network, the complexity of rejudging is N*log(N), because all the pages have to be traversed once, and the complexity of reusing the set each time requires log(N). OK, OK, I know python’s implementation of set is a hash — but that’s still too slow, or at least not very memory efficient.

What is the usual way of sentencing? Bloom Filter. Simply speaking, it is still a hash method, but its characteristic is that it can use fixed memory (does not grow with the number of urls) to determine whether the URL is already in the set with O(1) efficiency. The only problem is that if the URL is not in the set, BF can be 100% sure that the URL has not been seen. But if the URL is in the set, it will tell you: This URL should have already appeared, but I’m 2% uncertain. Notice that the uncertainty here is that when you allocate enough memory, it can be very, very small. Simple tutorial :Bloom Filters by Example

Note that if the URL has been looked at, it’s likely to be looked at repeatedly with a small probability (it doesn’t matter, it won’t kill you). But if they haven’t, they will (and this is important, or we’ll miss some pages!). . [IMPORTANT: There is a problem with this section, please skip it for now]

All right, we’re getting close to the fastest way to deal with sentencing. Another bottleneck — you only have one machine. It doesn’t matter how much bandwidth you have, if the speed at which your machine downloads web pages is the bottleneck, then you have to speed it up. If one machine isn’t enough — use many! Of course, we assume that each machine is being used for maximum efficiency — using multiple threads (in Python, multiple processes).

3) Cluster fetching

In total, I used more than 100 machines to run around the clock for a month. Imagine using only one machine for 100 months…

So, assuming you have 100 machines available, how do you implement a distributed crawling algorithm in Python?

Let’s call the 99 smaller machines out of the 100 slaves, and the larger machine out of the 100 master machines, so if we go back to the url_queue in the code above, if we can put this queue on this master machine, all the slaves will be able to connect to the master over the network, Every time a slave finishes downloading a page, it asks the master for a new page to grab. Every time a slave catches a new page, it sends all the links from that page to the master queue. Similarly, the Bloom filter is also placed on the master, but now the master only sends urls to the slave that are determined not to have been accessed. The Bloom Filter is placed in the memory of the master, and the accessed URL is placed in the Redis running on the master, so as to ensure that all operations are O(1). (At least amortized is O(1), Redis access efficiency see :LINSERT — Redis)

Consider how to implement python: install scrapy on each slave, then each machine becomes a slave with the ability to grab, and install Redis and RQ on the master as a distributed queue. The front row is 762, the middle row is 459, and the back row is 510. Put the above three groups of letters together in order. She will arrange to learn.

So the code is written

Well, as you can imagine, someone has written just what you need: darkrho/ scrapy-Redis GitHub

4) Outlook and post-processing

Despite all the “simple” rhetoric above, getting a commercial-scale usable crawler is not easy. The above code is used to crawl an overall site with little or no problem. The front row is 762, the middle row is 459, and the back row is 510. Put the above three groups of letters together in order. She will arrange to learn.

But if you want to attach these things, like

We don’t want to go through both People’s Daily and Damin Daily, which copied it. Effective information extraction (such as how to extract all the addresses extracted from the web page, “Chaoyang District Endeavour Road Zhonghua Road”), search engines usually do not need to store all the information, such as pictures I save to why… 4. Timely updates (Predict how often the page will be updated)

As you can imagine, each of these points could have been studied by many researchers for decades. Nevertheless, “The road ahead is long; I see no ending, yet high and low I will search.” So don’t ask how to get started, just hit the road. We should be disturbed by it. It’s done.

Click to learn more about PythonWeb development, data analysis, crawlers, and more.