u

Have ideal, have ambition, know self-discipline, believe in the near future you will succeed!

u

Open wechat search [children go to school], pay attention to this is not the same programmer.

u

This album I intend to [Python crawler], from 0 to 1 to bring you the introduction of crawler to proficient crawler, there will be more exciting content. Pay attention to me, follow me to learn the crawler!

u

  • Summary of reptile characteristics
  • Concept of crawler
  • The role of the reptile
  • Classification of reptiles
    • According to the number of the climbed gate, it can be divided into:
    • According to whether it is for the purpose of obtaining data, it can be divided into:
    • According to whether the URL address and corresponding page content are changed, the incremental data crawler can be divided into:
  • The crawler process
  • Concepts and differences between HTTP and HTTPS
  • The crawler pays special attention to the request header
  • The crawler pays special attention to the response header
  • Common response status codes
  • The process of HTTP requests
  • Pay attention to
  • The last

Python Crawlers: What is a Crawler?

See that spider up there? Don’t get me wrong. Today we’re going to teach you how to play with the spider up there. We officially learned the Python crawler from 0 to 1 with ease…….

Summary of reptile characteristics

  • Knowledge fragmentation

Crawler knowledge is very fragmented, because when we write crawlers, we will face a variety of websites, each website implementation technology is similar, but most of the time there are still differences, which requires us to use different technical means for different websites. A crawler is not like learning the Web to implement a certain function as long as you follow certain routines to make it.

  • Learning difficulty

The introduction of crawler is relatively simpler than that of web, but in the later stage, the difficulty of crawler is greater than that of Web. The difficulty lies in the confrontation between the crawler engineer and the operation and maintenance personnel. Maybe you write a website crawler, and the operation and maintenance personnel of the website add anti-crawling measures, so as a crawler engineer to solve this anti-crawling.

  • To study the characteristics of

Learning the crawler is not like learning the Web, learning the Web has a complete project to practice, because of the characteristics of the crawler, also leads to learning the crawler is a website as the object, can be understood as a technical point and a case.

Concept of crawler

u

Simulate a browser, send a request, get a response

Web crawler (also known as web spider, web robot) is to simulate the client (mainly refers to the browser) to send a request, receive the request response, according to certain rules, automatically grab the Internet information program.

  • In principle, anything a browser can do, a crawler can do
  • The crawler can only retrieve the data presented by the browser

Enter the Baidu website in the browser, open the developer tool, click “Network” and “Refresh” to capture packets.


Understand the concept of crawler

The role of the reptile

u

The role of crawlers in the Internet

  • The data collection
  • Software testing
  • 12306 rob
  • Web site to vote
  • Network security

Classification of reptiles

According to the number of the climbed gate, it can be divided into:

  • Generic crawlers such as search engines
  • Focused crawlers, such as 12306 ticketing, or specifically crawl a certain type of data on a website

According to whether it is for the purpose of obtaining data, it can be divided into:

  • Functional crawlers. Give your favorite stars a thumbs up
  • Incremental data crawlers, such as job listings

According to whether the URL address and corresponding page content are changed, the incremental data crawler can be divided into:

  • Incremental crawler based on URL address change and content change
  • URL address unchanged, content change of data incremental crawler
The crawler classification

Understanding the classification of reptiles


The crawler process

1, get a URL

2, send a request to the URL, and get the response (HTTP protocol)

3. If the URL is extracted from the response, the request is sent to get the response

4. If the data is obtained from the response, the data is saved


Grasp the crawler process


Concepts and differences between HTTP and HTTPS

In the second step of the crawler process, the request is sent to the URL, which depends on THE HTTP/HTTPS protocol.

u

HTTPS is more secure than HTTP, but lower performance

  • HTTP: Hypertext transfer protocol. The default port is 80

    . Hypertext: The hypertext is not limited to text, but can transmit data such as pictures, videos, and audio

    . Transport protocol: hypertext content converted to a string is passed in a fixed format using a common convention

  • HTTPS:HTTP+SSL (Secure Socket Layer), that is, hypertext transfer protocol with secure Socket Layer, default port 443

    . SSL encrypts the content of the transmission (the hypertext, that is, the request header and response body)

  • You can open a browser to access the URL, right check, click Network, select a URL, and see the HTTP protocol form.


Understand the concepts and default ports of HTTP and HTTPS


The crawler pays special attention to the request header

Request headers and response headers

u

In the HTTP request form shown above, the crawler pays special attention to the following request header fields

  • Content-Type
  • Host
  • Connection
  • Upgrade-insecure Requests(Upgrade to HTTPS Requests)
  • User Agent
  • Referer
  • Cookie (maintains user state)
  • Authorization (Authorization information)

For example, use the browser to visit Baidu for packet capture

4

When I click on the View Source, the request header will appear in a different format, this is the original version, if I don’t click on the View Source the request header format is optimized by the browser.

The crawler pays special attention to the response header

  • set-cookie
image-20201127151932476

Cookie is generated based on the server. In the client header information, when the request is sent to the server for the first time, the server generates cookie and stores it to the client. The next time the request is sent, the server will bring the cookie.

Common response status codes

  • 200: success
  • 302: jump. The new URL is given in the Location header in the response
  • 303: The browser redirects the POST response to the new URL
  • 307: The browser redirects the GET response to the new URL
  • 403: Resource unavailable, server understands client request but refuses to process it (no permission)
  • 404: Page not found
  • 500: Server internal error
  • 503: The server is not responding due to maintenance or overload. The retry-After response header may be carried in the response, possibly because the crawler accesses the URL so frequently that the server ignores the crawler’s request and returns a 503 status code

None of the status codes can be trusted, and all are based on the data obtained in the captured packet response

The source code captured in the network is the judgment basis. The source code in element is the rendered source code and cannot be used as a criterion.


Understand common response status codes


The process of HTTP requests

1. After obtaining the IP address corresponding to the domain name, the browser initiates a request to the URL in the address bar and obtains a response.

2. In the return response content (HTML), there will be URLS such as CSS, JS, images, and Ajax code. The browser will send other requests in the order of the response content, and get the response.

3. Every time the browser gets a response, it will add (load) the results displayed. JS, CSS and other contents will modify the page content.

The process from getting the first response and displaying it in the browser to finally getting the full response and adding content or modifications to the display result is called browser rendering.

Pay attention to

In the crawler, the crawler will only request the URL address, and the corresponding one will get the response corresponding to the URL address (the response can be HTML, CSS, JS or pictures, videos, etc.).

In many cases, the page rendered by the browser and the page requested by the crawler are different, because the crawler does not have the rendering function.

  • The result presented by the browser is the result of multiple requests and responses rendered together
  • The crawler only makes a request for a URL address and gets a response

Understand that the result of a browser display can be the result of multiple requests and responses rendered together, whereas a crawler is one response per request.


The last

I see no ending, yet high and low I will search!

I am a reader, a dedicated learner. The more you know, the more you don’t know.

See you next time for more highlights!