The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article comes from Tencent Cloud author: Yu Liang

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.) Data is the raw material for decision-making, high-quality data is expensive, how to mine raw materials to become the pioneer of the Internet era, master the source of information, can be a step faster than others.

In the era of big data, the Internet has become the carrier of a large amount of information, and mechanical copy and paste is no longer practical, time-consuming and easy to make mistakes. At this time, the emergence of crawler has freed people’s hands and won people’s favor with its ability of high-speed crawling and directional grasping of resources.

Crawlers are becoming more popular not only because they can quickly crawl huge amounts of data, but also because easy-to-use languages like Python make crawlers easy to use.

For Xiaobai, crawler may be a very complicated thing with high technical threshold, but it is actually very easy to achieve the right method to crawl the data of mainstream websites in a short time, but it is suggested that you should have a specific goal from the beginning.

When you’re goal-driven, your learning will be more accurate and efficient. All the prior knowledge you think is necessary can be learned along the way.

Based on the Python crawler, we organized a complete learning framework:Screening and sorting out what to learn and where to get resources is a common problem faced by many beginners.

Next, we’ll break down the learning framework, take a closer look at each part and recommend some resources for what to learn, how to learn, and where to learn.

Introduction of the crawler

Crawler is a program or script that can automatically capture information on the World Wide web according to certain rules.

This definition seems rather stiff, but let’s try a better way of explaining it:

The way we get web data as users is that the browser submits the request -> downloads the web code -> parses/renders the page;

The crawler’s way is to simulate the browser to send a request -> download the web code -> extract only useful data -> store it in a database or file.

The difference between crawler and us is that crawler only extracts the data useful to us in the web code, and crawler crawls fast and has a large magnitude.

With the scale of data, the efficient performance of crawler to obtain data becomes more and more prominent, and it can do more and more things:

· Market analysis: e-commerce analysis, business circle analysis, primary and secondary market analysis, etc

· Market monitoring: e-commerce, news, housing supply monitoring, etc

· Business opportunity discovery: bidding intelligence discovery, customer information discovery, enterprise customer discovery, etc

Crawler learning, the first thing to understand is the web page, those we can see bright web pages are supported by HTML, CSS, javascript and other web source code.

These source code is recognized by the browser into the web page we see, there must be a lot of rules in these source code, our crawler will be able to crawl in accordance with such rules to get the required information.

Robots protocol is the rules of crawler, which tells crawler and search engine which pages can be captured and which can not be captured.

This is usually a text file called robots.txt, placed at the root of the site.