We often think of the Internet as a big spider web, and a web crawler is a spider that crawls on a spider web. The nodes of a network are described as web pages, and each node is a web page. Crawlers who climb to nodes visit the page and obtain information.
Compare the connection between nodes to the link between web pages, so that the spider through a node, can continue to crawl along several lines to the next node. That is to say, through a web page to continue to obtain the back of the web page, so that the node of the entire network can be all climbed by the spider, the website data can be captured down. So what do we usually call a reptile?

Simply speaking, crawler is an automatic program that obtains web pages and extracts and saves information. The following rhino agent for everyone a simple introduction

(1) Get web pages

First of all, crawler work is to get the web page, here is to get the source code of the web page. Source code contains part of the web page useful information, it is necessary to get the source code down, you can extract the information you want.

(2) Extracting information

After obtaining the source code of the web page, the next work is to analyze the source code of the web page and extract the data information we want to get. The most common method is to use regular expression extraction, which is a common universal method, but this method also has disadvantages, in the construction of regular expression is more complex and prone to error.

In addition, because the structure of web pages has its own rules, there are libraries such as Beautiful Soup, PyQuery, and LXML that extract web page information based on page node attributes, CSS selectors, or XPath

And so on. Using these libraries, we can efficiently and quickly extract web page information, such as node attributes, text values, etc.

A very important part of crawler is to extract information. Crawler can make the disorganized data clear and convenient for us to process and analyze data later

(3) Save data

After extracting the information, we usually save the extracted data to the disk or a place set by ourselves for convenient use later. There are many forms to save here, the simplest is to save as
The sample,
Text or
JSON
Text can also be saved to a database, such as
MySQL
and
MongoDB
Etc., can also be saved to a remote server, as with
SFTP
Perform operations, etc.

(4) Automation procedures

An automated procedure means that a crawler can perform these operations manually. When the amount of information is very small, we can use manual information collection, but when there is a large amount of information or want to speed up the acquisition of a large amount of information and data, we need to rely on the power of the program. then
A crawler is an automated program that does the work for us,
The crawler
Various exception handling, error retries, and other operations can be performed during the crawl process to ensure that the crawl continues to run efficiently.

2. What kind of data can be captured

We can see all kinds of information on web pages, the most common is the regular web page,
They correspond to HTML code, and the most commonly grabbed is HTML source code. For example, we searched for “rhino proxy.”
In addition, some web pages do not return HTML code, but a JSON string (where the API interface is mostly in this form), such a format of data transmission and parsing is very convenient, they can also grab, and data extraction is more convenient.

In addition, we can also see all kinds of binary data, such as video, pictures and audio. Using crawler, we can capture these binary data and save it into the corresponding file name. In addition, you can also see a variety of extension files, CSS, JavaScript and configuration files, these are actually the most common files, as long as the browser can access it, you can grab it down.

All of the above are actually corresponding to their own

3. JavaScript renders the page

Sometimes when we crawl a web page with URllib or Requests, the source code we get is actually different from what we see in the browser. This is a very common problem. Now that web pages are increasingly built using Ajax, front-end modular tools, the entire web may be rendered by JavaScript, which means that the original HTML code is an empty shell, for example:

<! DOCTYPE html>

<html>

<head>

<meta charset=”UTF-8″>

<title>This is a Demo</title>

</head>

<body>

<div id=”container”>

</div>

</body>

<script src=”app.js”></script>

</html>

body

Note There is only one node whose ID is Container
After the node is introduced
It is responsible for rendering the entire site.

When you open the page in a browser, the HTML content is loaded first, and the browser will notice that one has been introduced
App.js file, which then requests the file, gets it, and executes the JavaScript code in it, which changes the nodes in the HTML, adds content to it, and finally gets the complete page. But in the use

As a result, the source code you get using the basic HTTP request library might not look much like the page source in the browser. In such cases, we can analyze the back-end Ajax interface, or we can use libraries such as Selenium and Splash to implement the simulation

JavaScript rendering.