If you observe carefully, it is not difficult to find that there are more and more people who know and learn crawlers. On the one hand, more and more data are available on the Internet. On the other hand, programming languages like Python provide more and more excellent tools to make crawlers simple and easy to use.

By using crawlers, we can obtain a large amount of value data, so as to obtain information that cannot be obtained in perceptual knowledge, such as:

  • Zhihu: Climb the quality answers, screen out the best quality content for you under each topic.
  • Taobao and JINGdong: capture commodity, review and sales data, and analyze the consumption scenes of various commodities and users.
  • Anjuke and Lianjia: capture real estate sales and rental information, analyze the trend of housing prices, and analyze housing prices in different regions.
  • Pull hook, Zhaopin: climb all kinds of job information, analysis of various industries talent demand and salary level.
  • Snowball net: capture the behavior of snowball high-return users, analyze and forecast the stock market.

For xiaobai, crawler may be a very complicated, high technical threshold of things. For example, some people think that learning a crawler must master Python, and then hum the system to learn every knowledge point of Python, but after a long time, they still cannot climb the data; Some people think that the first to master the knowledge of the web page, then began HTML CSS, the results into the front-end pit, suddenly…… However, mastering the right method to be able to crawl the data of major websites in a short period of time is actually very easy to achieve, but it is recommended that you have a specific goal from the beginning. When you’re goal-driven, your learning will be more accurate and efficient. All the prior knowledge you think is necessary can be learned along the way. Here’s a smooth, no-basics, quickstart learning path.

  • Learn the Python package and implement the basic crawler process
  • Understand the storage of unstructured data
  • Learn scrapy and build engineered crawlers
  • Learn database knowledge, deal with large-scale data storage and extraction
  • Master various skills to cope with anti-crawling measures of specific websites
  • Distributed crawler, realize large-scale concurrent collection, improve efficiency

1. Learn Python package and realize the basic crawler process. Most crawlers follow the process of “sending request — obtaining page — parsing page — extracting and storing content”, which actually simulates the process of obtaining webpage information by using browser. There are many crawler packages in Python: URllib, Requests, BS4, scrapy, PySpider, and so on. It is recommended to start with Requests +Xpath, which links websites and returns pages, and Xpath, which parsed pages and extracted data. If you’ve ever used BeautifulSoup, Xpath is a lot easier, eliminating all the work of checking element code layer by layer. So down the basic routine is about the same, the general static website is not at all, Douban, Qiushi Encyclopedia, Tencent news and so on basically can get started. Of course, if you need to crawl asynchronously loaded websites, you can learn how to use the browser to capture and analyze real requests or learn Selenium to implement automation, so that zhihu, Mtime, Tripadvisor and other dynamic websites can also be solved.

2. Understand the storage of unstructured data

The data crawled back can be stored locally as a document or stored in a database. You can use Python’s syntax or pandas to store data in CSV files. You may find that the data is not clean. There may be errors, omissions, and other errors. You can also clean the data by learning the basic usage of the pandas package.

3. Learn scrapy and build engineered crawlers

You’ll be able to pull off average levels of data and code, but a strong scrapy framework is especially useful in complex situations where you might still struggle. Scrapy is a very powerful crawler framework. It is not only easy to build request, but also powerful selector can easily parse response, but the most amazing thing about it is that it is extremely high performance, allowing you to build crawlers, modular. Learn to scrapy, you can build your own crawler frame, you will have a crawler engineer’s mind.

4. Learn database basics and deal with large-scale data storage

When the amount of data you’re crawling back is small, you can store it as a document, but once the amount of data is large, it’s a little hard to do. Therefore, it is necessary to master a kind of database. Learning the current mainstream MongoDB, Redis is OK (MySql can understand). MongoDB makes it easy to store unstructured data, such as text of various comments, links to images, etc. You can also use PyMongo to make it easier to manipulate MongoDB in Python. Because the database knowledge to be used here is actually very simple, mainly data how to enter the library, how to extract, in need of time to learn on the line.

Of course, there will be some desperation in the crawler process, such as blocked IP, strange captacks, userAgent access restrictions, dynamic loading, etc. Encountered these anti-crawler means, of course, also need some advanced skills to deal with, such as access frequency control, the use of proxy IP pool, packet capture, verification code OCR processing and so on. Often websites tend to prefer the former between efficient development and anti-crawler, which also provides space for crawlers, master these anti-crawler skills, the vast majority of websites have been difficult to you.

6. Distributed crawler to achieve large-scale concurrent collection

It won’t be a problem to crawl basic data, your bottleneck will focus on how efficiently you can crawl massive amounts of data. At this point, I believe you will naturally come across a very impressive name: distributed crawler. Distributed this thing, sounds very scary, but in fact is the use of multi-threading principle to allow multiple crawlers to work at the same time, you need to master the three tools Scrapy + MongoDB + Redis. Scrapy is used for basic page crawls, MongoDB for data to crawl, and Redis for page queues to crawl. So some things look scary, but when you break them down, they’re just that. When you can write distributed crawlers, then you can try to build some basic crawler architecture to achieve more automated data retrieval.

You see, this learning path, you can already become an old driver, very smooth. So at the beginning, try not to systematically nibble on something, find a practical project (start with something simple like douban or piglet), and start straight away. Because crawler technology does not require you to systematically master a language, nor does it require sophisticated database technology, effective posture is to learn these scattered knowledge points from actual projects, you can ensure that each time you learn the most needed part. Of course, the only trouble is that in the specific problem, how to find the specific part of the learning resources, how to screen and screening, is a big problem many beginners face.