The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article comes from Tencent cloud author: Penguin small series

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)



At the beginning of my contact with crawler, IT was amazing. With a dozen lines of code, I could obtain all the information of countless web pages, automatically select web elements and automatically organize them into structured files.

Using these data, we can do a lot of field analysis, market research and get a lot of valuable information. It is a pity that this skill is not for my use, so I began to learn it decisively.

1. It’s not always the easiest

At the beginning of the crawler is not very understanding, and there is no computer, programming foundation, really a little meng. There’s no clear idea of where to start, what to learn first, and what to wait until you have some foundation.

Since it’s a Python crawler, Python is a must, so let’s start with Python. So I went through some tutorials and books to learn about basic data structures, then lists, dictionaries, tuples, various functions and control statements (conditional statements, loop statements).

After learning for a period of time, I found that I had not touched the real reptile, and I forgot the pure theory study very quickly, and it was too time-consuming to go back to review, so don’t be too desperate. After going over the basics of Python, IT’s funny to think that I still haven’t installed an IDE that can type code.

2. Get started

The turning point came after READING a technical article about crawlers. The clear thinking and easy to understand language made me think that this was the crawler I wanted to learn. So I decided to create an environment to see how the crawler actually plays. (Of course you can call this impetuous, but every white person wants to do something intuitive and feedback.)

For fear of making mistakes, I installed a relatively safe Anaconda and used the built-in Jupyter Notebook as an IDE to write the code. See a lot of people say that because the configuration environment out of all kinds of bugs, almost glad. A lot of times it’s not the thing that beats you, it’s the crawler configuration.

Another problem encountered is that Python crawlers can be implemented with many packages or frameworks. Which one should I choose? My principle is simple and easy to use, write less code, for a small white, performance, efficiency, all I pass. So I started urllib, BeautifulSoup, because it was easy to hear.

My first case was the movie of crawling douban. Countless people recommended Douban as an example for beginners, because the page was simple and anti-crawler was not strict. Follow some introductory examples of crawling douban movies to see, from these examples, understand the basic principles of crawler: download page, page parsing, positioning and extraction of data.

Of course, I did not go to the system to see urllib and BeautifulSoup. I need to solve the problems in the current instances, such as downloading and page parsing, which are basically fixed statements. I just need to use them directly, so I don’t need to learn the principle.

Download and parse fixed sentence patterns of the page using urllib

Of course, the basic methods in BeautifulSoup are not negligible, but they are nothing more than find, get_text() and the like, and contain very little information. In this way, through other people’s ideas and their own search for the use of beautiful soup, completed the douban film’s basic information crawl.



BeautifulSoup to crawl douban movie details

3. Crawlers are on the right track

Once you have some routines and patterns, you’ll have a goal and you can move on. Or Douban, to find their own climb to get more information, climb multiple movies, multiple pages. At this point, the foundation is found to be insufficient, such as the statement control involved in crawling multiple elements, turning pages, handling multiple situations, and such as the strings, lists, dictionaries involved in extracting content.

Going back to the basics of Python is very specific, and can be used immediately to solve problems, so that you understand more deeply. So until douban TOP250 books and movies climb down, basically understand the basic process of a crawler.

BeautifulSoup is fine, but it takes some time to understand the basics of a web page, otherwise the positioning and selection of elements can be a headache.

Later, I realized xpath, which is a basic tool for getting started. You can just copy it directly from Chrome. Even if you have to write your own xpath, you can do it in an hour with a few pages of xpath tutorials at W3School. Requests also seems better than URllib, but trial and error is always a matter of trial and error, and the cost of trial and error is time.



Requests +xpath for TOP250 books in Douban

4. You’re in trouble with anti-crawlers

With Requests +xpath, I was able to crawl a lot of websites, and later practiced xiaozhu’s rental information and dangdang’s book data. I’m not going to return any headers from my crawler. I’m going to pretend that my crawler is a browser and I’m going to know what the headers message is.

Add headers information to the crawler, pretending to be a real user

Then there are all kinds of unlocated elements, and then you know that this is asynchronous loading, the data is not in the source code of the web page, you need to grab the package to get the information of the web page. So preview in various JS and XHR files and look for links containing data.

Of course, it is good to know that the file itself is not loaded, find the JSON file directly to obtain the corresponding data. (Here to amway a Chrome plugin: JsonView, let white easily understand json files)



The browser grabs data loaded by JavaScript

Here is the understanding of anti-crawler, of course, this is the most basic, stricter IP restrictions, verification code, text encryption and so on, may also meet a lot of difficulties. But isn’t it good that the current problems can be solved one by one in order to learn more efficiently?

For example, when climbing other websites later, the IP will be blocked, which can be solved simply by the method of time.sleep() to control the crawl frequency. If the restriction is strict or the crawl speed needs to be guaranteed, the proxy IP will be used to solve the problem.

Of course, when I tried Selenium, it really crawls based on real browsing behavior (click, search, page turn), so for sites that are particularly anti-crawler, there is no solution. Selenium is a super handy, albeit slightly slower, tool.

5. Try a strong Scrapy framework

With requests+xpath and the package capture method, you can do a lot of things. Movies under different categories on Douban, 58.com, Zhihu, and Pull are basically ok. However, when the scale of data to climb is large and requires flexible processing of each module, it can be overwhelming.

Therefore, I learned about the powerful Scrapy framework, which can not only easily build Request, but also the powerful Selector can easily parse Response. However, the most amazing thing is its super high performance, which can make the crawler engineering and modularization.

The basic components of a Scrapy framework

I learned Scrapy, tried to build a simple crawler framework by myself, and was able to think about large-scale crawler problems structurally and engineering when doing large-scale data crawler, which enabled me to think about problems from the dimension of crawler engineering.

Of course, it is difficult to understand the selector, middleware, spiders and so on of Scrapy. It is recommended to combine specific examples and refer to others’ code to understand the implementation process, so as to better understand.



Using Scrapy to pick up a lot of rental information

6. The local file cannot be touched, go to the database

After crawling back to a large amount of data, it was found that local files were very inconvenient to save, and even if they were saved, opening large files would cause the computer to jam seriously. What to do? Decide on the database ah, so began to pit MongoDB. Both structured and unstructured data can be stored, and with PyMongo installed, it’s easy to manipulate a database in Python.

MongoDB itself can be troublesome to install, and if you try it yourself, you may get stuck. At the beginning of the installation is also a variety of bugs, fortunately, small X advice, to solve a lot of problems.

Of course, for the crawler this piece, does not need how advanced database technology, mainly data storage and extraction, along with the master of basic insert, delete and other operations. In short, it is OK to be able to efficiently extract the data that climbs down.



Crawl pull recruitment data and store it in MongoDB

7. Legendary distributed crawlers

At this time, basically a large part of the web page can be climbed, the bottleneck focuses on the efficiency of climbing large-scale data. As a result of learning Scrapy, it naturally came into contact with a very powerful name: distributed crawler.

In addition to learning Scrapy and MongoDB in the previous section, it seems that you also need to know Redis.

Scrapy is used for basic page crawls, MongoDB is used to store crawls, and Redis is used to store crawls for page queues, also known as task queues.

Distributed is scary, but breaking it down and learning incrementally is only so much.



Distributed crawl 58.com: Define the project content section

There are indeed many pits in zero-based learning reptiles, which can be summarized as follows:

1. Environment configuration, various installation packages, environment variables, too unfriendly to small white;

2. Lack of reasonable learning path, Python, HTML learning, extremely easy to give up;

3.Python has many packages and frameworks to choose from, but White doesn’t know which is friendlier;

4. Don’t even know how to describe problems, let alone find solutions;

5. The information on the Internet is scattered and unfriendly to xiao Bai, and much of it looks cloudy and foggy;

6. You seem to know something, but it turns out to be hard to write your own code.

……………………………………

So, like me, many people’s biggest experience is: try not to systematically nibble on something, find a practical project (starting with something as simple as Douban), and directly start.

Because crawler technology does not require you to systematically master a language, nor do you need sophisticated database technology, from the actual project to learn these scattered knowledge points, you can ensure that every time you learn is the most needed part.