HTTP Protocol Overview

HyperText Transfer Protocol (HTTP) is an application-layer Protocol for distributed, collaborative and hypermedia information systems. HTTP is the basis for data communication on the World Wide Web.

The development of HTTP was initiated by Tim Berners-Lee in 1989 at CERN, the European Organization for Nuclear Research. The development of HTTP standards was coordinated by the World Wide Web Consortium (W3C) and the Internet Engineering Task Force (IETF), which eventually published a series of RFCS, The most famous of these was RFC 2616, published in June 1999, which defined HTTP 1.1, a version of the HTTP protocol that is widely used today.

In December 2014, the Hypertext Transfer Protocol Bis (HTTPBIS) Working Group of the Internet Engineering Task Force (IETF) submitted the HTTP/2 standard proposal to the IESG for discussion, which was approved on 17 February 2015. The HTTP/2 standard was officially published as RFC 7540 in May 2015, replacing HTTP 1.1 as the implementation standard of HTTP.

HTTP Protocol Overview

HTTP is a standard (TCP) for client end (user) and server end (web site) requests and responses. Using a Web browser, web crawler, or other tool, the client makes an HTTP request to a specified port on the server (the default port is 80). We call this client the User Agent. The answering server stores resources such as HTML files and images. We call this reply server the Origin Server. There may be multiple “middle tiers” between the user agent and the source server, such as a proxy server, gateway, or tunnel.

Although TCP/IP is the most popular application on the Internet, HTTP is not required to use it or the layer it supports. In fact, HTTP can be implemented over any Internet protocol, or any other network. HTTP assumes that its underlying protocols provide reliable transport. Therefore, any protocol that provides such assurance can be used by it. Therefore, it uses TCP as its transport layer in the TCP/IP protocol family.

Typically, an HTTP client initiates a request to create a TCP connection to a specified port on the server (default: port 80). The HTTP server listens for client requests on that port. Once a request is received, the server returns a status to the client, such as “HTTP/1.1 200 OK”, along with what is returned, such as the requested file, error message, or other information.

How HTTP works

The HTTP protocol defines how a Web client requests a Web page from a Web server and how the server delivers the Web page to the client. The HTTP protocol uses a request/response model. The client sends a request packet to the server containing the request method, URL, protocol version, request header, and request data. The server responds with a status line containing the protocol version, success or error code, server information, response headers, and response data.

Here are the steps for an HTTP request/response:

1. Client Connecting to the Web server An HTTP client, usually a browser, establishes a TCP socket connection with the HTTP port (80 by default) on the Web server.

2. Send an HTTP request Through the TCP socket, the client sends a text request packet to the Web server. A request packet consists of the request line, request header, blank line, and request data.

3. The server accepts the request and returns HTTP to the Web server to parse the request and locate the requested resources. The server writes the resource copy to the TCP socket, which is read by the client. A response consists of a status line, a response header, a blank line, and response data.

4. Release the CONNECTION TCP Connection If the Connection mode is set to Close, the server actively closes the TCP connection, and the client actively closes the connection to release the TCP connection. If the Connection mode is Keepalive, the connection is kept for a period of time, during which requests can be received.

5. The client browser parses the HTML content. The client browser first parses the status line to see the status code indicating whether the request was successful. Each response header is then parsed, and the response header tells the following several bytes of HTML document and the document’s character set. The client browser reads the response data HTML, formats it according to the HTML syntax, and displays it in the browser window.

For example, enter the URL in the browser address bar and press Enter. The following process occurs:

  1. The browser requests the DNS server to resolve the IP address corresponding to the domain name in the URL.
  2. After the IP address is resolved, a TCP connection is established with the server based on the IP address and the default port 80.
  3. The browser sends an HTTP request to read a file (the file following the domain name in the URL). The request packet is sent to the server as the third packet of the TCP three-way handshake.
  4. The server responds to the browser request and sends the corresponding HTML text to the browser;
  5. Release the TCP connection.
  6. The browser renders the HTML text and displays the content;

HTTP is an application-layer protocol based on TCP/IP.

Request-response based pattern

According to the HTTP protocol, a request is made from the client, and the server responds to the request and returns. In other words, start with the clientWhen communication is established, the server does not send a response until the request is received

Stateless save

HTTP is a stateless protocol that does not save state. The HTTP protocol itself does not store the state of communication between requests and responses. That is, at the HTTP level, the protocol does not persist requests or responses that have been sent.

With HTTP, every time a new request is sent, a new response is generated. The protocol itself does not retain information about all previous request or response messages. The HTTP protocol is designed to be so simple in order to process a large number of transactions more quickly and ensure protocol scalability. However, as the Web continues to evolve, the number of cases where business processing becomes tricky due to statelessness increases. For example, a user logging into a shopping site needs to be able to stay logged in even after he jumps to other pages on the site. For this example, the site needs to save the user’s status in order to know who sent the request. Although HTTP/1.1 is a stateless protocol, Cookie technology was introduced in order to achieve the desired state retention function. With cookies and HTTP communication, state can be managed. More on cookies later.

There is no connection

Connectionless means to limit processing to one request per connection. The server disconnects from the customer after processing the request and receiving the reply from the customer. In this way, the transmission time can be saved and the concurrency performance can be improved. Instead of establishing a permanent connection with each user, the server and client will be interrupted once a request is made. But no connection there are two ways of the HTTP protocol is a response after a request directly to disconnect, but now the HTTP protocol version 1.1 is not directly to disconnect, but wait a few seconds, what is this for a few seconds and so on, such as the users have the subsequent operations, if the user has a new request in this a few seconds, So before or through connection channel to send and receive messages, if the user does not have a few seconds to send new requests, then will be disconnected, which can improve efficiency, reduce the number of connection is established in a short period of time, because the connection is time-consuming, now seems to be in 3 seconds, by default, but this time can be adjust by our backend code to, According to the behavior of their own website users to analyze the statistics of an optimal waiting time.

HTTP request methods

The HTTP/1.1 protocol defines eight methods (also called “actions”) to manipulate a given resource in different ways:

GET

Makes a Show request to the specified resource. Using the GET method should only be used to read data and should not be used for “side effects” operations, such as in Web Applications. One reason is that GET can be accessed randomly by web spiders and so on.

HEAD

Like the GET method, it makes a request to the server for a specified resource. Only the server will not return the text portion of the resource. The advantage is that you can retrieve “information about the resource” (meta information or metadata) without having to transfer the entire content.

POST

Submit data to a specified resource, asking the server to process it (for example, submit a form or upload a file). The data is included in the request article. This request may create a new resource or modify an existing resource, or both.

PUT

Uploads its latest content to the specified resource location.

DELETE

Requests the server to remove the resource identified by request-URI.

TRACE

The command output displays the requests received by the server for testing or diagnosis.

OPTIONS

This method causes the server to return all HTTP request methods supported by the resource. Use ‘*’ instead of the resource name to send an OPTIONS request to the Web server to test whether the server functions properly.

CONNECT

Reserved in HTTP/1.1 for proxy servers that can pipe connections. Typically used for links to SSL encrypted servers (via an unencrypted HTTP proxy server).

Matters needing attention:

  1. Method names are case sensitive. When the resource for which a request is made does Not support the request Method, the server returns status code 405 (Method Not Allowed). When the server does Not recognize or support the request Method, the server returns status code 501 (Not Implemented).
  2. The HTTP server should at least implement the GET and HEAD methods; the rest are optional. Of course, all method supported implementations should match the respective semantic definitions of the methods described below. Additionally, in addition to the above methods, a particular HTTP server can extend custom methods. For example PATCH (the method specified by RFC 5789) is used to apply local changes to resources _. _

Request method: GET and POST request (through the form form we write to see)

  • The data submitted by GET is placed after the URL, in the request line, as? Split URL and transfer data with ampersand between parameters, such as EditBook? Name =test1&id=123456. The POST method places the submitted data in the request body of the HTTP package.
  • The data submitted by GET is limited in size (because browsers have limits on the length of urls), while the data submitted by the POST method is not limited.
  • GET and POST requests GET the request data differently on the server than they do on the server.

The HTTP status code

The first line of all HTTP responses is the status line, followed by the current HTTP version number, a three-digit status code, and a phrase describing the status, separated by Spaces.

The first number in the status code represents the type of the current response:

  • 1XX message — The request has been received by the server and continues processing
  • 2XX success – The request was successfully received, understood, and accepted by the server
  • 3XX redirection – Subsequent action is required to complete this request
  • 4XX request error – The request contains a lexical error or cannot be executed
  • 5XX server error – An error occurred while the server was processing a valid request

Although phrases describing state have been recommended in RFC 2616, such as “200 OK”, “404 Not Found”, WEB developers can still decide which phrases to use to display localized state descriptions or custom information.

URL

The Hypertext Transfer Protocol (HTTP) Uniform Resource Locator incorporates the five basic elements of getting information from the Internet in a simple address:

  • Transport protocol.
  • Hierarchical URL marker ([//], fixed)
  • Credential information needed to access the resource (omitted)
  • The server. (Usually domain name, sometimes IP address)
  • The port number. (It is expressed in numbers and can be omitted if the default HTTP value is :80.)
  • The path. (Distinguish each directory name in the path with a slash character)
  • The query. Form parameter in GET mode, with “?” Each parameter is separated by “&”, and then “=” to separate the parameter name and data, usually in UTF8 URL encoding to avoid character conflicts.)
  • Fragment. Start with the # character

In www.luffycity.com:80/news/index.html?id=250&page=1, for example, among them:

HTTP is the protocol; www.luffycity.com, is the server; 80 is the default network port number on the server and is not displayed by default. /news/index.html, is the path (URI: direct to the corresponding resource); ? Id =250&page=1, is a query. Most web browsers do not require users to enter the “http://” part of a web page because most web content is a hypertext Transfer protocol file. Similarly, “80” is a common port number for hypertext Transfer protocol files, so it is generally not necessary to specify it. In general users type in part of the uniform resource locator (www.luffycity.com:80/news/index….

Because the hypertext Transfer protocol allows the server to redirect the browser to another web address, many servers allow users to omit parts of a web address, such as WWW. Technically, the omitted web address is actually a different web address. The browser itself cannot determine whether the new address is valid or not. The server must do the redirection.

HTTP Request Format (Request Protocol)

URL contains: /index/index2? A = 1 & b = 2; The path and the parameters are here.

Request on the surface of the head, for example: the length length of the inside of the request body data, the request of the other head below the key/value pair, we can speak in succession, probably it is ok to know, one of the user-agent, be need you to remember, is to tell your service side, what I send you request.

Take JINGdong as an example, take a look at user-Agent

Look at an example of crawler, it is no problem when you climb jingdong, but you must take user-Agent when you climb drawer, because the drawer makes judgment on user-agent to judge whether you are a normal request, which is a kind of anti-pickling mechanism.

Open our demo. HTML file and open it in your browser to see what the page looks like.

Write the meaning of the content above is to let you know there is such a request header, some of them are meaningful, request we can also define your own head, the headers in requests module = {}, this dictionary in it.

HTTP Response format (Response Protocol)