😀 This is the 2nd original crawler column

Before we learn about web crawlers, we need to understand the basics of HTTP in detail, and understand what happens between typing a URL in the browser and retrieving web content. Understanding these contents will help us further understand the basic principles of crawlers.

1.1 HTTP Fundamentals

In this section, we’ll take a closer look at the basics of HTTP and see what happens between typing in a URL in the browser and retrieving web content. Understanding these contents will help us further understand the basic principles of crawlers.

1. The URI and URL

Let’s look at URIs and urls first. The full name of A URI is a Uniform Resource Identifier. The full name of a URL is Universal Resource Locator. For example, github.com/favicon.ico is both a URL and a URI. There is an icon resource whose access is uniquely specified using a URL/URI. This includes the access protocol HTTPS, the access path (the root directory), and the resource name favicon.ico. This resource can be found on the Internet through a link called the URL/URI.

A URL is a subset of a URI, which means that every URL is a URI, but not every URI is a URL. So, what kind of URI is not a URL? Uris also include a subclass called URN, whose full Name is Universal Resource Name. URN names the resource without specifying how to locate it. For example, URN: ISBN :0451450523 specifies the ISBN of a book that uniquely identifies the book, but does not specify where to locate the book. The relationship between URLS, UrNs, and URIs is shown in Figure 1-1.

But urNs are rarely used on the Internet today. Almost all URIs are urls, so for general web links, we can call them either URLS or URIs. I prefer to call them urls.

But THE URL is not written casually, it also needs to follow a certain format specifications, the basic composition of the format is as follows:

scheme://[username:password@]hostname[:port][/path][;parameters][?query][#fragment]
Copy the code

The contents contained in parentheses represent non-essential parts. For example, the URL www.baidu.com only contains scheme and host parts, while other port, path, parameters, Query and fragment parts are not included.

Here we introduce the meanings and functions of the following parts respectively:

Scheme: protocol. For example, common protocols include HTTP, HTTPS, FTP, etc. In addition, Scheme is often called protocol, which stands for protocol.
Username and password: indicates the username and password. In some cases, the URL can be accessed only if the user name and password are provided, so put the user name and password in front of host. For example, if the URL ssr3.scrape. Center requires a user name and password, you can write it as admin:[email protected].
Hostname: indicates the host address. It can be a domain name or IP address. For example, the hostname in the URL www.baidu.com is www.baidu.com, which is the secondary domain name of Baidu. For example, in https://8.8.8.8, hostname is 8.8.8.8, which is an IP address.
Port: indicates a port. This is the service port specified by the server, such as the port 12345 in the URL https://8.8.8.8:12345. However, some urls do not contain port information. This is because the default port is 80 for HTTP and 443 for HTTPS. So www.baidu.com is really the same as www.baidu.com:443, and www.baidu.com is really the same as www.baidu.com:80.
Path: indicates the path. The name refers to the specified address of the network resource on the server, such as github.com/favicon.ico where path is favicon.ico, which refers to accessing the resource favicon.ico in the root directory on GitHub.
Parameters: indicates parameters. During a visit to a resource to develop additional information, such as https://8.8.8.8:12345/hello; User in this case, user is parameters. But parameters are rarely used these days, so many people now refer to the query part of the parameter as a parameter, or even use parameters and Query intermixed. Technically, parameters are a semicolon; The rest.
Query: indicates a query. This parameter is used to query certain resources. If multiple queries exist, separate them with &. Query is actually quite common, such as www.baidu.com/s?wd=nba&ie… The query part is wd= Nba&ie = UTF-8, where wd is NBA and IE is UTF-8. Since query is used much more frequently than parameters mentioned above, parameters, parameters, params, and so on are often used to refer to Query. Strictly speaking, it should be query.
Fragment: It is a partial complement to the resource description and can be understood as a bookmark within the resource. At present, it has two main applications. One is used for single-page routing. For example, modern front-end frameworks such as Vue and React can use it for routing management. Another application is as an HTML anchor, allowing you to scroll down a page to a specific location when it opens.

This gives us a brief overview of the basic concepts and composition of urls, and we will use several practical case studies to help deepen our understanding.

2. HTTP and HTTPS

We have just seen the basic structure of urls, which support many protocols, such as HTTP, HTTPS, FTP, SFTP, SMB, and so on.

In crawlers, the pages we capture are usually based on HTTP or HTTPS protocols. Here, we will first understand the meaning of these two protocols.

HTTP stands for Hyper Text Transfer Protocol. HTTP protocol is used to transfer hypertext data from network to local browser. It can ensure efficient and accurate transmission of hypertext documents. HTTP is a specification developed by the World Wide Web Consortium and the Internet Engineering Task Force (IETF). The most widely used version is HTTP 1.1, although HTTP 2.0 is now supported on many web sites.

Its development history is shown in the table below:

version	Have the time	The main features	Current situation of the development of
HTTP / 0.9	In 1991,	Does not involve data packet transmission, specifies the communication format between the client and server, can only GET request	Not as a formal standard
HTTP / 1.0	In 1996,	You can add the PUT, PATCH, HEAD, OPTIONS, and DELETE commands	Formally as a standard
HTTP / 1.1	In 1997,	Persistent connection (long connection), bandwidth saving, HOST field, pipeline mechanism, block transfer coding	Formally as a standard and widely used
HTTP / 2.0	In 2015,	Multiplexing, server push, header compression, binary protocol, etc	Gradually covering the market

HTTPS is the Hyper Text Transfer Protocol over Secure Socket Layer. It is the Secure version of HTTP, that is, adding SSL Layer under HTTP. HTTPS for short.

The security of HTTPS is based on SSL. Therefore, the content transmitted through HTTPS is encrypted by SSL. HTTPS has the following functions:

Establish an information security channel to ensure the security of data transmission.
Verify the authenticity of the site. For websites that use HTTPS, you can click the lock icon in the address bar of the browser to view the actual information after the website authentication, or use the security seal issued by the CA to query the information.

More and more websites and apps are now moving towards HTTPS, as illustrated below.

Apple is forcing all iOS apps to use HTTPS encryption by January 1, 2017, otherwise they will not be available in the App Store.
Starting with Chrome 56, which was launched in January 2017, Google has put a risk alert on web addresses that are not encrypted using HTTPS. This is a prominent warning in the address bar that the page is not secure.
Tencent wechat mini program’s official requirements document requires the background to use HTTPS requests for network communication. Domain names and protocols that do not meet the requirements cannot be requested.

Therefore, HTTPS has become the trend.

Note: HTTP and HTTPS protocols belong to the application layer protocols in computer networks, and the lower layer is based on TCP protocol. TCP protocol belongs to the transport layer protocols in computer networks, including the three-way handshake when establishing a connection and the four-way wave when disconnecting. However, this book mainly talks about the web crawler, mainly crawling HTTP/HTTPS protocol related content, so there is no further explanation of TCP, IP and other related knowledge, interested readers can search for relevant materials to understand, such as “Computer Network”, “Illustrated HTTP” books.

3. HTTP request process

We type a URL into the browser, press Enter and see the page content in the browser.

In effect, the browser sends a request to the server where the web site is located. The web server receives the request, processes it, parses it, and returns the corresponding response, which is then passed back to the browser.

Since the response contains the source code and other content of the page, the browser parses it and presents the web page, as shown in Figure 1-3.

In this case, the client represents our own PC or mobile browser, and the server is the server where the website to be accessed resides.

To illustrate the process more visually, here’s a demo using Chrome’s Network Listener component in developer mode, which displays all Network requests and responses that occur when visiting the currently requested page.

Open Chrome browser, visit Baidu www.baidu.com/, right click the mouse and choose “Check” menu (or directly press the shortcut key F12) to open the developer tool of the browser, as shown below:

Switch to the Network panel and refresh the page again. At this time, you can see that there are many entries at the bottom of the Network panel, one of which represents the process of sending a request and receiving a response, as shown in the figure below:

Let’s start with the first network request, www.baidu.com, where the columns have the following meanings.

The first column Name: the Name of the request, usually the last part of the URL.
The second column Status: the Status code of the response, which is displayed as 200, indicating that the response is normal. Using the status code, we can determine whether the request received a normal response after it was sent.
Protocol: indicates the Protocol type of the request, where HTTP /1.1 indicates HTTP 1.1 and h2 indicates HTTP 2.0.
The fourth column Type: the document Type requested. This is called document, which means what we’re asking for this time is an HTML document, and the content is just some HTML code.
Column 5 Initiator: request source. Used to mark the object or process from which the request originated.
Column 6 Size: the Size of the file downloaded from the server and the requested resource. If the resource is fetched from the cache, this column displays from Cache.
Column 7 Time: The total Time from the Time the request was initiated to the Time the response was received.
Column 8 Waterfall: Visual Waterfall flow with network requests.

Click on the entry to see more details, as shown in the figure.

In the General section, Request URL is the URL of the Request, Request Method is the Method of the Request, Status Code is the response Status Code, Remote Address is the Address and port of the Remote server, Referrer Policy refers to the Referrer discrimination Policy.

Further down, you can see that there are Response Headers and Request Headers, which represent the Response Headers and Request Headers respectively. The request header contains a lot of request information, such as browser identifier, Cookie, Host and other information, which is a part of the request. The server will judge whether the request is legitimate according to the information in the request header, and then make the corresponding response. The Response Headers shown in Figure 1-5 is part of the Response, which contains information such as the server type, document type, and date. After receiving the Response, the browser parses the Response to present the web content.

Let’s take a look at what’s involved in the request and response.

4. Request

A Request is sent by a client to a server. It consists of four parts: Request Method, Request URL, Request Headers, and Request Body.

Below we introduce them respectively.

Request Method

The Request Method identifies the Request Method used by the requesting client to Request the server. There are two common Request methods: GET and POST.

Type the URL directly into the browser and press Enter. This initiates a GET request, and the parameters of the request are included directly in the URL. For example, if you search for Python in Baidu, this is a GET request, linked to www.baidu.com/s?wd=Python, where the URL contains the query information of the request, and the parameter wd indicates the keyword to be searched. POST requests are mostly made when the form is submitted. For example, for a login form, clicking the “Login” button after entering a user name and password usually initiates a POST request whose data is usually transferred as a form rather than reflected in a URL.

GET and POST request methods differ as follows:

The parameters in the GET request are contained in the URL, where the data can be seen. The URL of the POST request does not contain this data. The data is transmitted as a form and is included in the request body.
GET requests submit a maximum of 1024 bytes of data, while POST has no limit.

Generally speaking, when you log in, you need to submit the user name and password, which contains sensitive information. If you use GET mode, the password will be exposed in the URL, resulting in password leakage. Therefore, it is best to send the password in POST mode. POST is also used to upload a large file.

The vast majority of requests we encounter on a regular basis are GET or POST requests. In addition, there are some request methods, such as GET, HEAD, POST, PUT, DELETE, CONNECT, OPTIONS, TRACE, etc., which are briefly summarized in the following table.

methods	describe
GET	Request the page and return the page content
HEAD	Similar to a GET request, except that there is no concrete content in the response returned, which is used to retrieve the header
POST	Most are used to submit forms or upload files, and the data is contained in the request body
PUT	Data sent from the client to the server replaces the contents of the specified document
DELETE	Asks the server to delete the specified page
CONNECT	Use the server as a springboard to access other web pages instead of the client
OPTIONS	Allows clients to view server performance
TRACE	The command output displays the requests received by the server for testing or diagnosis

This table reference: www.runoob.com/http/http-m… .

Request URL

The requested URL, Reqeust URL, uniquely identifies the resource we want to request. As for the composition of URL and the function of each part, we have already mentioned in the previous article, so we will not repeat it here.

Request Headers

The Request Headers (Request Headers in English) is used to describe additional information to be used by the server, such as Cookie, Referer and User-agent.

Here is a brief description of some common headers:

Accept: Request header field that specifies what types of information the client can Accept.
Accept-language: Specifies the Language type acceptable to the client.
Accept-encoding: Specifies the content Encoding acceptable to the client.
Host: Specifies the IP address and port number of the Host requesting the resource. The content is the location of the original server or gateway requesting the URL. As of HTTP 1.1, requests must include this content.
Cookie: Also commonly used in the plural, Cookies. These are data stored locally by a web site in order to identify whether a user is tracking a session. Its main function is to maintain the current access session. For example, after we enter the user name and password to log in to a website successfully, the server will use the session to save the login status information. Later, when we refresh or request other pages of the site, we will find the login status, which is the credit of cookies. Cookie contains information identifying the session of the server corresponding to us. Every time the browser requests the page of this site, it will add Cookie in the request header and send it to the server. The server identifies ourselves through Cookie and finds out that the current state is login state. So the return result is the content of the web page that you can only see after logging in.
Referer: This content is used to identify the page from which the request was sent. The server can get this information and do corresponding processing, such as source statistics, anti-theft chain processing, etc.
User-agent: UA for short. UA is a special string header that enables the server to identify information such as the operating system and version, browser and version used by the customer. Adding this information to a crawler can masquerade as a browser; If you don’t, you’re likely to be identified as a crawler.
Content-type: Also called Internet Media Type or MIME Type. In HTTP headers, it is used to indicate the Media Type in a specific request. For example, text/ HTML stands for HTML format, image/ GIF stands for GIF image, and application/json stands for JSON type. You can check the mapping table at tool.oschina.net/commons for more information.

Therefore, the request header is an important part of the request, and it is necessary to set the request header in most cases when writing the crawler.

Request Body

The Request Body, the Request Body, typically carries the form data in a POST Request, while the Request Body is empty for a GET Request.

For example, the request and response captured when I log into GitHub is shown in Figure 1-6.

Before login, we fill in the user name and password information, which will be submitted to the server as form data. Note that the Content-type specified in Request Headers is Application/X-www-form-urlencoded. Form data will be submitted only if the content-type is set to Application/X-www-form-urlencoded. Alternatively, we can set the content-type to Application /json to submit JSON data, or multipart/form-data to upload files.

The following table shows the relationship between the Content-Type and POST submission methods

Content-Type	The way data is submitted
application/x-www-form-urlencoded	The form data
multipart/form-data	Form file upload
application/json	Serialize JSON data
text/xml	The XML data

In crawler, if POST request is to be constructed, it is necessary to use the correct Content-Type and understand which content-Type is used for each parameter setting of various request libraries, otherwise the normal response may not be possible after POST submission.

5. Response

A Response is returned by the server to the client. It can be divided into Response Status Code, Response Headers, and Response Body.

Response Status Code

Response Status Code, namely, Response Status Code, indicates the Response Status of the server. For example, 200 indicates that the server responds normally, 404 indicates that the page is not found, and 500 indicates that an internal error occurs on the server. In crawler, we can judge the response state of the server according to the status code. If the status code is 200, it will prove that the data has been successfully returned and further processing will be carried out. Otherwise, it will be directly ignored. The following table lists common error codes and their causes.

Common error codes and causes

Status code	instructions	details
100	Continue to	The requester should continue to make the request. The server has received part of the request and is waiting for the rest
101	Switch protocols	The requester has asked the server to switch protocols, and the server has confirmed and is ready to switch
200	successful	The server has successfully processed the request
201	Has been created	The request succeeds and the server creates a new resource
202	Have accepted	The server has accepted the request but has not yet processed it
203	Unauthorized information	The server has successfully processed the request, but the information returned may come from another source
204	There is no content	The server successfully processed the request, but did not return anything
205	Reset the content	The server successfully processed the request and the content was reset
206	Part of the content	The server successfully processed some of the requests
300	A variety of options	The server can perform a variety of actions on a request
301	A permanent move	The requested web page has been permanently moved to a new location, a permanent redirect
302	Temporary mobile	The requested page is temporarily redirected to another page
303	Look at other locations	If the original request was POST, the redirected target document should be extracted via GET
304	unmodified	The page returned by this request has not been modified, continuing to use the last resource
305	Using the agent	The requester should access the page using a proxy
307	Temporary redirection	The requested resource temporarily responds from another location
400	Bad request	The server was unable to resolve the request
401	unauthorized	The request was not authenticated or failed authentication
403	Blocking access	The server rejected the request
404	Not found	The server could not find the requested page
405	Method to disable	The server disabled the method specified in the request
406	Don’t accept	The requested web page cannot be responded to with the requested content
407	Agency authorization is required	The requester needs to use proxy authorization
408	The request timeout	Server request times out
409	conflict	A server conflict occurred while completing a request
410	deleted	The requested resource has been permanently deleted
411	Required effective length	The server will not accept requests that do not contain a valid content-length header field
412	Prerequisites are not met	The server did not meet one of the prerequisites that the requester set in the request
413	Request entity too large	Request entities are too large for the server to handle
414	The request URI is too long	The requested URL is too long for the server to process
415	Unsupported types	The request format is not supported by the requested page
416	Inconsistent scope of request	The page cannot provide the requested scope
417	Unsatisfied expected value	The server did not meet the requirements for the expected request header field
500	Server internal error	The server encountered an error and could not complete the request
501	Unrealized.	The server does not have the capability to complete the request
502	Bad gateway	The server, acting as a gateway or proxy, received an invalid response from the upstream server
503	Service unavailable	The server is currently unavailable
504	Gateway timeout	The server acts as a gateway or proxy, but does not receive requests from the upstream server in time
505	The HTTP version is not supported	The server does not support the HTTP protocol version used in the request

Response Headers

The Response Headers contains the Response information of the Server to the request, such as Content-Type, Server, and set-cookie. Some common headers are briefly described below.

Date: indicates the time when the response is generated.
Last-modified: Specifies the time when the resource was Last Modified.
Content-encoding: Specifies the Encoding of the response Content.
Server: Contains information about the Server, such as the name and version number.
Content-type: Specifies the Type of data to be returned. For example, text/ HTML means to return AN HTML document, Application/X-javascript means to return a javascript file, and image/ JPEG means to return an image.
Set-cookie: sets the Cookie. The set-cookie in the response header tells the browser that it needs to put this content in a Cookie, and the next request carries the Cookie request.
Expires: Specifies an expiration time for a response that causes the proxy server or browser to update the loaded content into the cache. If accessed again, it can be loaded directly from the cache, reducing the load on the server and shortening the load time.

Response Body

The Response Body is the Response Body, which can be said to be the most critical part. The Body data of the Response is in the Response Body. For example, when requesting a web page, its Response Body is the HTML code of the web page. When an image is requested, its response body is the binary data of the image. After we do crawler request web page, the content to be resolved is the response body, as shown in Figure 1-7.

Click Preview in the browser developer tools, and you’ll see the source code for the web page, the contents of the response body, which is the target for parsing.

When doing crawler, we mainly get the source code and JSON data of the web page through the response body, and then extract the corresponding content from it.

In this section, we’ve looked at the basics of HTTP, and roughly the request and response process behind accessing a web page. This section covers a lot of information that needs to be understood and is often used when analyzing web requests.

6. HTTP / 2.0

As mentioned earlier, HTTP 2.0 has been released since 2015. Compared with HTTP/1.1, HTTP/2.0 is faster, simpler, and more stable. HTTP/2.0 has made many optimizations at the transport layer. The main goals of HTTP/2.0 are to reduce latency by enabling full request and response reuse, to minimize protocol overhead by effectively compressing HTTP request header fields, and to increase support for request prioritization and server push. These optimizations eliminate a series of “tweaks” that HTTP/1.1 has come up with for transport optimization.

Some readers may ask at this point, why not HTTP/1.2 instead of HTTP/2.0? Because HTTP/2.0 internally implemented a new layer of binary framing, which was not backward compatible with previous HTTP/1.x servers and clients, the main version number was changed to 2.0.

Let’s take a look at some of the improvements HTTP/2.0 has made over HTTP/1.1.

Binary frame layer

At the heart of all of HTTP/2.0’s performance enhancements lies this new binary framing layer. In HTTP/1.x, both Request and Response are transmitted in text format. Headers and Body are separated by text newlines. HTTP/2.0 optimizes this by changing the text format to a binary format, making parsing more efficient. The request and response data are split into smaller frames and encoded in binary.

So here are a few new concepts:

Frame: A concept that exists only in HTTP/2.0. It is the smallest unit of Data communication. For example, a Request is divided into a Request Headers frame and a Request Data frame.
Data stream: A virtual channel that can carry two-way messages, each stream identified by a unique integer ID.
Message: A complete sequence of frames corresponding to a logical request or response message.

In HTTP/2.0, all communication under a domain name can be done over a single connection that can host any number of two-way data streams. Data streams are used to host two-way messages, each of which is a logical HTTP message (such as a request or response) that can contain one or more frames.

In short, HTTP/2.0 breaks down HTTP protocol communication into an exchange of binary-encoded frames that correspond to messages in a particular data stream, all of which are multiplexed within a TCP connection, which is the basis for all other features and performance optimizations of the HTTP/2.0 protocol.

multiplexing

In HTTP/1.x, if the client wants to make multiple parallel requests to improve performance, it must use multiple TCP connections, and the browser controls the resources and has a limit of 6-8 TCP connection requests for a single domain name. In HTTP/2.0, however, thanks to binary framing, HTTP/2.0 no longer needs to be multiplexed over a TCP connection. Clients and servers can break HTTP messages into discrete frames, send them interlaced, and reassemble them at the other end, allowing us to:

Multiple requests are sent in parallel and interleaved, without affecting each other.
Multiple responses are sent interleaved in parallel and do not interfere with each other.
Use a single connection to send multiple requests and responses in parallel.
You don’t have to do much work to get around HTTP/1.x restrictions anymore.
Reduce page load times by eliminating unnecessary latency and increasing utilization of existing network capacity.

Thus, the performance of the overall data transmission is greatly improved:

The same domain name occupies only one TCP connection. Multiple requests and responses are sent in parallel using one connection, eliminating the delay and memory consumption caused by multiple TCP connections.
Multiple requests and details are sent in parallel and interleaved without affecting each other.
In HTTP/2.0, each request can have a priority value of 31 bits.0 indicates the highest priority. A higher value indicates a lower priority. With this priority value, clients and servers can take different policies when dealing with different streams to optimally send streams, messages, and frames.

Flow control

Flow control is a mechanism that prevents a sender from sending a large amount of data to a receiver that exceeds the latter’s needs or processing capacity. It can be understood that the receiver is too busy to process incoming messages, but the sender is still sending a large number of messages, which can cause some problems.

For example, a client might request a large video stream with a high priority, but the user has paused the video, and the client now wants to pause or limit the transfer from the server to avoid extracting and buffering unnecessary data. For example, a proxy server may have fast downstream connections and slow upstream connections, and may want to adjust the speed of the downstream connection to match the speed of the upstream connection to control its resource utilization, and so on.

Because HTTP is implemented based on TCP, TCP has a native flow control mechanism, but because HTTP/2.0 data flows are multiplexed within a TCP connection, TCP flow control is neither refined nor provides the necessary application-level API to regulate the transmission of individual data flows.

To address this issue, HTTP/2.0 provides a simple set of building blocks that allow clients and servers to implement their own data flow and connection-level flow control:

Flow control is directional. Each receiver can choose to set any window size for each data stream and for the entire connection according to their needs.
Flow control is based on credit. Each receiver can publish its initial connection and data flow flow control window (in bytes) whenever the sender sends itDATAFrames are all reduced and sent at the receiverWINDOW_UPDATEIncreases at frame time.
Flow control cannot be deactivated. After an HTTP/2.0 connection is established, the client will exchange with the serverSETTINGSFrame, which sets the flow control window in both directions. The default value for the flow control window is set to 65535 bytes, but the receiver can set a larger maximum window size (2 ^ 31-1Bytes), and is sent when any data is receivedWINDOW_UPDATEFrames to maintain this size.
Flow control is hop – by – hop rather than end – to – end control. That is, trusted mediations can use it to control resource usage and implement resource allocation mechanisms based on their own conditions and heuristic algorithms.

Thus, HTTP/2.0 provides simple building blocks for implementing custom policies to regulate resource usage and allocation, as well as enabling new transport capabilities, while improving the real and perceived performance of web applications.

Server push

Another powerful new feature added to HTTP/2.0 is that the server can send multiple responses to a single client request. In other words, in addition to the response to the initial request, the server can push additional resources to the client without the client explicitly requesting them.

If the client must request some resources, the server push technology can be adopted. After the client initiates a request, it pushes necessary resources to the client in advance, thus reducing the delay time. For example, a server can actively push JS and CSS files to a client without having to send those requests while the client parses the HTML.

The server can actively push, of course, the client also has the right to choose whether to receive. If the server pushes a resource that has already been cached by the browser, the browser can reject it by sending RST_STREAM.

In addition, active push also complies with the same origin policy, that is, the server cannot push third-party resources to the client randomly, but only after confirmation by both parties, which also ensures certain security.

Current state of HTTP/2.0 development

HTTP/2.0 popularity is a long way to go, some mainstream websites now support HTTP/2.0, mainstream browsers have now implemented HTTP/2.0 support, but in general, most websites are still HTTP/1.1 based.

Other programming language libraries do not fully support HTTP/2.0. For Python, for example, Hyper, HTTPX, and others support HTTP/2.0, but the widely-used Requests library still only supports HTTP/1.1.

7. To summarize

This section introduces some basic knowledge about HTTP, the content is a lot, need to master, these knowledge for the following we write and understand the web crawler has a very great help.

Since most of the content in this section is a conceptual introduction, it refers to many books, documents and blogs, from the following sources:

Books – The Definitive Guide to HTTP by David Gourley/Brian Totty
Documentation – HTTP – wikipedia: en.wikipedia.org/wiki/Hypert…
Documentation – HTTP – baidu encyclopedia: baike.baidu.com/item/HTTP/2…
Document-http-mdn Web Docs: developer.mozilla.org/en-US/docs/…
Document – HTTP / 2 profile – Google development document: developers.google.com/web/fundame…
Blog – are read HTTP HTTP / / 2 and 3 characteristics: blog.fundebug.com/2019/03/07/…
Blog – are HTTP / 2 features: read zhuanlan.zhihu.com/p/26559480

For more exciting content, please pay attention to my public account “Attack Coder” and “Cui Qingcai | Jingmi”.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

[2022 年] Python3 crawler tutorial – HTTP fundamentals

1.1 HTTP Fundamentals

1. The URI and URL

2. HTTP and HTTPS

3. HTTP request process