A complete HTTP request


A complete HTTP request process

What happens when you type www.linux178.com into the browser’s address bar and press Enter, press enter and you see the page?

The following process is for personal understanding only:

The server responds to the HTTP request. The browser gets the HTML code. The browser parses the HTML code. And request resources in HTML code (such as JS, CSS, images, etc.) -> browser to render the page to the user

For details about HTTP, see the following:

The HTTP protocol ramble on http://kb.cnblogs.com/page/140611/

An overview of the HTTP protocol, http://www.cnblogs.com/vamei/archive/2013/05/11/3069788.html

Understand all aspects of the HTTP Headers, http://kb.cnblogs.com/page/55442/

The following is an analysis of the above process. Let’s take Chrome as an example:

1. Domain name resolution

First, Chrome will resolve the IP address corresponding to the domain name www.linux178.com. How to resolve the corresponding IP address?

(1) Chrome will first search the browser’s own DNS cache (cache time is short, about only 1 minute, and can only hold 1000 caches), to see if its cache has the corresponding entry www.linux178.com, and has not expired, if so, the parsing ends here.

Note: How do we view Chrome’s own cache? You can use Chrome ://net-internals/# DNS to check

(2) If the browser’s own cache does not find the corresponding entry, Chrome will search the OPERATING system’s own DNS cache, if found and not expired, the search stops parsing end.

Note: How do I view the DNS cache of the operating system? Take Windows as an example. You can run ipconfig /displaydns on the cli to view the DNS cache

③ If you cannot find the DNS file in the Windows cache, try reading the hosts file (C:\Windows\System32\drivers\etc) to see if there is an IP address corresponding to the domain name. If there is, the resolution succeeds.

(4) If no corresponding entry is found in the hosts file, the browser initiates a DNS system call to the local preferred DNS server (usually provided by telecom carriers). You can also use DNS servers like Google’s) to initiate domain name resolution requests (through UDP protocol to DNS port 53, this is a recursive request, that is, the carrier’S DNS server must provide us with the IP address of the domain name). The carrier’s DNS server first looks up its cache. If the corresponding entry is found and does not expire, the parsing succeeds. If no corresponding entry is found, the carrier’s DNS initiates an iterative DNS resolution request on behalf of our browser. It first searches for the DNS IP address of the root domain (this DNS server has 13 built-in ROOT domain DNS IP addresses), and then searches for the DNS address of the root domain. Will initiate a request to it (please ask www.linux178.com the IP address of this domain name is how many ah? The root domain discovers that this is a top-level domain, a domain name in the COM domain, and tells the carrier’s DNS that I don’t know the IP address of this domain, but I know the IP address of the COM domain, you go find it, and the carrier’s DNS gets the IP address of the COM domain, I sent a request to the IP address of the com domain (what is the IP address of the domain name www.linux178.com?), the com domain server told the DNS of the carrier that I do not know the IP address of the domain name www.linux178.com, But I know the DNS address of linux178.com, so you go to find it, so the carrier’S DNS provides the DNS address of Linux178.com (this is usually provided by a domain registrar, like Wan wan, What is the IP address of www.linux178.com? “, the DNS server of linux178.com domain checked, ah, sure enough here, so I sent the found result to the DNS server of the carrier, this time the DNS server of the carrier got the IP address corresponding to the domain name www.linux178.com, And returned to the Windows system kernel, the kernel returned the results to the browser, finally the browser got the IP address corresponding to www.linux178.com, the step of action.

Note: Normally the following steps will not be performed

If the resolution fails after the preceding four steps, perform the following steps (for Windows) :

5) The operating system will find the NetBIOS name Cache. What is the NetBIOS name Cache? The computer name and Ip address of any computer that I have communicated with successfully in the last period of time will be stored in this cache. When does this step resolve successfully? If this name was successfully communicated to me just a few minutes ago, then this step can be successfully resolved.

NETBIOS name and IP address: NETBIOS name and IP address: NETBIOS name and IP address: NETBIOS name

⑦ If the first step is not successful, then the client will broadcast search

If step 7 does not succeed, the client will read the LMHOSTS file (in the same directory as the HOSTS file)

If step 8 has not been resolved, the resolution is declared a failure and the target computer cannot communicate. As long as one of these eight steps can be resolved successfully, it can successfully communicate with the target computer.

See the packet capture screenshot below:

Linux VIRTUAL machine test, using the command wget www.linux178.com to request, found that the direct use of Chrome browser request, interference request is relatively large, so we use the wget command to request, but using the wget command can only return the index.html request, Static resources (JS, CSS, etc.) contained in index. HTML are not requested.

Packet capture analysis:

① Packet number, this is the VIRTUAL machine broadcast, to obtain the MAC address 192.168.100.254 (also known as the gateway), because LAN communication depends on the MAC address, why it needs to communicate with the gateway is because our DNS server IP is a peripheral IP, to go out must rely on the gateway to help us out.

Packet no. ②, this is the gateway response to the VIRTUAL machine after receiving the broadcast, telling the virtual machine its MAC address, so that the client found the routing egress.

(3) the package, this package is wget command to the system configuration of DNS server domain name resolution requests (should be accurately wget launched a DNS system call), the request of the domain name www.linux178.com, expect is IP6 address (AAAA represents the IPv6 address)

④ packet, the DNS server to the system’s response, it is clear that the current use of IPv6 is still very few, so not AAAA records

The host name www.linux178.com.leo.com does not exist, so the result is no such name

⑥ packet, this is the requested domain name corresponding to the IPv4 address (A record)

The DNS server received the IP address of the domain name from the cache. The DNS server received the IP address of the domain name from the cache. The system then gave the wget command to the system. It can also be seen that the client and the local DNS server are recursively querying (that is, the server must give the client a result) and this is the next step, the TCP three-way handshake.

2. Initiate a TCP three-way handshake

After obtaining the IP address corresponding to the domain name, the User-agent (generally refers to the browser) initiates a TCP connection request to the WEB application (HTTPD,nginx, etc.) on a random port (1024 < port < 65535). Once the connection request (the original HTTP request is encapsulated in the TCP/IP4 layer model) arrives at the server (through various routing devices, except lans), goes to the nic, and then to the kernel’s TCP/IP stack (used to identify the connection request, unpack the packet, layer by layer unpack). It may also pass through the Netfilter firewall (a module belonging to the kernel) and eventually reach the WEB application (Nginx as an example in this article), and eventually establish a TCP/IP connection.

The diagram below:

1) The Client sends a connection test. ACK=0 indicates that the acknowledgement number is invalid, SYN = 1 indicates that the datagram is a connection request or connection accept packet, and seq = x indicates the Client’s initial sequence number (seq =0 indicates packet number 0). In this case, the Client enters the syn_sent state, indicating that the Client waits for the reply from the server

2) After the Server listens to the connection request message and agrees to establish a connection, it sends a confirmation message to the Client. Both SYN and ACK are set to 1 in the TCP packet header. ACK = X +1 indicates that the sequence number of the first data byte in the next packet segment is X +1, and that all data up to X has been correctly received (ACK =1 actually means ACK =0+1, which is the first packet expected from the client). Seq = y indicates the initial sequence number of the Server itself (seq=0 indicates packet number 0 sent from the Server). In this case, the server enters syn_RCvd, indicating that the server has received the connection request from the Client and waits for the confirmation from the Client.

3) After receiving the confirmation, the Client needs to send the confirmation again, along with the data to be sent to the Server. ACK 1 indicates that ACK = y + 1 is valid (the first packet is expected to be received from the server), and the Client’s own serial number seq= x + 1 (this is my first packet, as opposed to the 0th packet). The TCP connection enters the Established state and can initiate the HTTP request.

See the screenshots of captured packets:

Package no. ⑨ corresponds to step 1 above.

⑩ packets correspond to step 2 above.)

Package number corresponds to step 3 above)

Why does TCP need three handshakes?

Here’s an example:

Let’s go to the Palace Museum and see Xiao Ming. Let’s go to the Palace Museum and see Xiao Ming.

Excuse me, Can you Speak English?

Xiaoming: Yes.

Foreigner: OK,I want…

Before asking for directions, the foreigner asked Xiao Ming if he could speak English. xiao Ming answered yes, and then the foreigner started asking for directions

2 computer communication is by agreement (currently popular in the TCP/IP protocol) to implement, if the two computer use agreement is different, it is can’t communicate, so the three times handshake is equivalent to test whether the other party to follow the TCP/IP protocol, after the completion of the negotiation can communicate, understand it, of course, is not so accurate.

Why is HTTP implemented based on TCP?

At present in the Internet all transmission is through TCP/IP, HTTP protocol as TCP/IP model of the application layer protocol is no exception, TCP is a reliable end-to-end connection-oriented protocol, so HTTP based on the transport layer TCP protocol need not worry about data transmission of various problems.

3. Establish a TCP connection and send an HTTP request

After the TCP3 handshake is entered, the browser initiates an HTTP request (see packet 1) using the HTTP method GET method. The request URL is/and the protocol is HTTP/1.0

Here are the details of package No. 12:

The preceding packets are HTTP request packets.

What is the format of HTTP request and response messages?

Start line: e.g. GET/HTTP/1.0 (request method request URL request protocol used)

Header information: user-agent Host and other values in pairs

The main body

Both request packets and response packets follow the preceding format.

So what are the request methods in the start line?

GET: to request a resource in its entirety.

HEAD: requests only the response HEAD

POST: Submit a form (common)

PUT: (webdav) Upload files (but browsers do not support this method)

DELETE :(webdav) DELETE

OPTIONS: Methods that return methods supported by the requested resource

TRACE: The agent that goes through the process of pursuing a resource request (this method cannot be issued by the browser)

So what is URL, URI, URN?

URI Uniform Resource Identifier Specifies the Uniform Resource Identifier

URL Uniform Resource Locator URL Uniform Resource Locator

The format is as follows: Scheme ://[username:password@]HOST:port/path/to/source

http://www.magedu.com/downloads/nginx-1.5.tar.gz

URN Uniform Resource Name Specifies the Uniform Resource Name

Urls and UrNs are urIs

For convenience, both URL and URI are temporarily referred to as the same thing

What kinds of protocols are requested?

There are the following:

HTTP / 0.9: stateless

HTTP /1.0: MIME, keep-alive, cache

HTTP /1.1: More request methods, finer cache control, and persistent connections are more commonly used

The following is the header information of the HTTP request packet sent by Chrome

Among them

Accept tells the server that I Accept those MIME types

Accept-encoding This appears to Accept files that are compressed

Accept-lanague tells the server which languages can be sent

Connection tells the server to support the keep-alive feature

Cookie Cookie is carried with each request so that the server can identify if it is the same client

Host is used to identify the virtual Host on the requesting server. For example, Nginx can define many virtual hosts

So this is where you mark which virtual host you want to visit.

User-agent Is a web browser. There are also other types of User agents, such as wget curl search engine spiders

Condition request header:

If-modified-since is when the browser is asking the server for a resource file and If it’s been Modified Since when, then send it back to me, so that I can guarantee the resource on the server side

When a file is updated, the browser requests it again, rather than using the file in the cache

Security request header:

Authorization: Authentication information provided by the client to the server.

What is MIME?

MIME (Multipurpose Internet Mail Extesions) is an Internet standard that extends the E-mail standard to support non-ASCII characters, binary attachments, and other Mail messages. This standard is defined in RFC 2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049, etc. RFC 2822, a variant of RFC 822, states that the E-mail standard does not allow the use of characters outside the 7-bit ASCII character set in mail messages. Because of this, some non-English character messages and non-text messages such as binaries, images, and sounds cannot be transmitted in email. MIME specifies symbolic methods for representing a wide variety of data types. In addition, the MIME framework is used in the HTTP protocol used in the World Wide Web, and the standard is extended to include Internet media types.

MIME follows the following formats: Major /minor Major /minor For example:

1
2
3
4
5
image
/jpg
image
/gif
text
/html
video
/quicktime
appliation
/x-httpd-php

4. The server responds to the HTTP request, and the browser obtains the HTML code

Packet number 12 is the HTTP request packet and packet number 32 is the HTTP response packet

When the server-side WEB program receives the HTTP request, it processes the request and returns the HTML file to the browser.

Packet no. 32 is the HTTP response packet returned from the server to the client (the MIME type of the 200 OK response is text/ HTML), indicating that the HTTP request initiated by the client has been successfully responded. 200 indicates the status code of the successful response, and other status codes are as follows:

1xx: information status code

100, 101

2xx: success status code

200: OK

3xx: redirection status code

301: permanent redirect. The value at the head of the Location response is still the current URL, so it is a hidden redirect.

302: Temporary redirect, explicit redirect, Location response header value is the new URL

For example, when comparing the cached resource file with the server, the server returns a 304 status code.

Tell the browser that you don’t need to request the resource, just use the local resource.

4xx: indicates the client error status code

404: Not Found The requested URL resource does Not exist

5xx: indicates the server error status code

500: Internal Server Error Indicates an Internal Server Error

502: Bad Gateway Occurs when the front proxy server fails to contact the back-end server

504: Gateway Timeout This is when the proxy can reach the server at the back end, but the server at the back end does not respond to the proxy server within the specified time

The following response header is displayed in Chrome:

Connection Uses the keep-alive feature

Content-encoding Compresses resources in gzip mode

Content-type The MIME type is HTML, and the character set is UTF-8

Date Indicates the Date of the response

Server Indicates the WEB Server used

Transfer-encoding is a data Transfer mechanism in HTTP that allows HTTP data sent from a web server to a client application (usually a web browser) to be split into multiple parts, The chunked transfer encoding is provided only in HTTP version 1.1 (HTTP/1.1)

Than this can reference (http://blog.csdn.NET/tenfyguo/article/details/5939000)

X-ray Pingback reference (http://blog.sina.com.cn/s/blog_bb80041c0101fmfz.html)

How does the server generate an HTML file after receiving an HTTP request?

Assume that the server uses the Nginx +PHP(FastCGI) architecture to provide services

① Nginx reads configuration files

We input in the address bar of the browser is http://www.linux178.com (http:// can not enter, the browser will automatically help us add), actually complete should be behind http://www.linux178.com./ there is a point (this point represents the root domain, The URL is http://www.linux178.com/. Nginx receives a GET/request from the browser. Will read HTTP request inside the header information, according to the Host to match their own all virtual Host configuration file server_name, see if there is a match, there is a match then read the virtual Host configuration, found the following configuration:

1
root
/web/echo

This tells you that all of your web files are in this directory and this directory is/and when we go to http://www.linux178.com/, Visit http://www.linux178.com/index.html, for example, the representative/web/echo here is a file named index. The HTML

1
index index.html index.htm index.php

If you enter http://www.linux178.com/, nginx will automatically find the index.html file for you (assuming the home page is index.php). If the file is not found, go down, if all three files are not found, then throw a 404 error), then add the URL to /index.php, and proceed with the configuration

1
2
3
4
5
6
7
location ~ .*\.php(\/.*)*$ {

root
/web/echo
;

Fastcgi_pass 127.0.0.1:9000;

fastcgi_index index.php;

astcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;

include fastcgi_params;
}

This configuration specifies that any requested URL that matches the *.php suffix (in this case with regular expression enabled) is handed over to the backend FastCGI process for processing.

② Give the PHP file to the FastCGI process

So nginx sends the /index.php URL to the backend fastCGI process. When fastCGI completes, nginx returns an index.html file. Nginx returns the index.html to the browser, so the browser gets the HTML code for the home page and Nginx writes an access log to the log file.

Note 1: How does nginx find index.php?

When nginx finds that it needs /web/echo/index.php, it makes an IO system call to the kernel, telling the kernel, The kernel reads the index.php file from the hard disk into the kernel’s own memory space, and copies the file to the nginx process’s memory space. Nginx then gets the files it wants.

Note 2: How does finding files work at the file system level?

For example, nginx needs to get the /web/echo/index.php file

Each partition (ext3 ext3 file system, block is the smallest unit of file storage 4096 bytes by default) contains metadata area and data area, each file in the metadata area has metadata entries (usually 128 bytes), each entry has a number. This inode contains the file type, permissions, number of connections, owner and array IDS, time stamp, and the number of disk blocks that the file occupies. Each file can occupy multiple blocks, and blocks are not necessarily contiguous, each block is numbered), as shown below:

There is another important point: directories are files in general and require disk blocks, not containers. As you can see, the default directory created is 4096 bytes, which means only one disk block is required, but this is not certain. Therefore, to find the directory, we also need to find the corresponding entry in the metadata area. Only by finding the corresponding inode, we can find the disk block occupied by the directory.

So what’s in that directory, isn’t it files or something?

The directory contains a table containing the directory or file name and the corresponding inode number (temporarily called the mapping table), as shown below:

Assuming that

/ Occupies block 1 and block 2 in the data area, / is actually a directory with 3 directories web 111

The Web occupies block 5 where there are two echo data directories

Echo block 11 contains a file called index.php

Index.php occupies block 15 and block 16 are files

The following figure shows its distribution in the file system

So how does the kernel find the index.php file?

The kernel gets the /web/echo/index.php file requested by the Nginx IO system call

(1) The kernel reads the metadata block/inode, reads the number of the corresponding data block from the inode, and then finds the corresponding block (block 1 and block 2) in the data block, reads the mapping table on block 1, and finds the corresponding inode number of the name web in the metadata block

② The kernel reads the inode (No. 3) corresponding to Web, from which it learns that the block corresponding to Web in the data area is block 5, so it finds block 5 in the data area, reads the mapping table from it, and knows that the inode corresponding to Echo is no. 5, so it finds inode no. 5 in the metadata area

Echo (); echo (); echo (); echo (); echo (); echo (); echo (); echo ()

PHP = 15; PHP = 16; inode = 15

5. The browser parses the HTML code and requests resources in the HTML code

When the browser gets the index. HTML file, it starts parsing the HTML code in it. When it comes to static resources such as JS/CSS /image, it requests the server to download them (multithreading is used for downloading, and the number of threads in each browser is different). In this case, keep-Alive is used. When you set up an HTTP connection, you can request multiple resources, and download the resources in the same order as the code. However, since each resource is of different size, and the browser has multiple threads requesting resources, the following figure shows that the order shown here is not necessarily the order in the code.

When the browser requests a static resources (in the case of not expired), the server side by an HTTP request (ask since the last update time to have any modify resources), if the server returns a 304 status code (tell the browser server without modification), the browser can directly read the resources of the local cache file.

Detailed browser works look at: http://kb.cnblogs.com/page/129756/

6. The browser renders the page to the user

Finally, the browser uses its own internal working mechanism to render the requested static resources and HTML code and then render it to the user.

This completes a complete HTTP transaction.