(Note: since all domain names need to be registered now,.tech domain name is not allowed to be registered, the following nladuo.tech will be changed to nladuo.cn.)

Talk about HTTP requests: GET and POST

In the previous section, we invoked the requests. Get method to download the HTML page without knowing how it worked. In this section, we’ll talk about what HTTP requests are and their features.

In HTTP requests, there are two main methods: GET and POST. The main differences are as follows:

  • The information for GET is stored in the URL, such as the “? CategoryId = 1 “.
  • The POST information is stored in the form. For example, when we enter the login user name and password, we will not see these private information in the website. Of course, when we upload a large file, like a 1 gigabyte video, we don’t put the video information in the URL.

In addition, HTTP is a short connection protocol based on TCP. Why? Because the HTTP request process is basically:

First, our browsers connect to each other via TCP and the port of the remote Web server. Then send an instruction, such as to fetch the contents of the root link (url ‘/’); The server then returns the results (HTML text, images, etc.) that the client (our browser) requests. When the client receives the request, it closes the connection, and the Web server closes the connection with the client. This completes an HTTP request.

So how do browsers and Web servers send instructions?

View HTTP requests in Chrome

Using nladuo.cn/test.html as a test page, open Chrome’s developer tools while typing the link and adjust to the Network option.

The page has only one request, its HTML, to return “It Works”. Next, click on the test.html item in the request list to see the Requests Headers for the server and the Response Headers for the client.

We can click on the View Source next to the Requests Headers to see the original request header.

Here you can see:

  • Request “/test.html” on nladuo.cn, using the HTTP1.1 version of the protocol,
  • The Host of the requested domain name is nladuo.cn
  • Wait…

Similarly, look at Response Headers,

You can see:

  • The HTTP1.1 version of the protocol is used and the return code is 200 OK
  • What’s the date
  • The server is Apache 2.4.7(Ubuntu)
  • Wait…

Let’s use the COMPUTER’s own TCP client Telnet to simulate sending a GET request and see the actual process of a GET request.

Use the Telnet client on your computer

For MAC or Linux users, you can run Telnet on the terminal without performing any configuration.

For Windows users, you can view the Enabling Telnet section here.

Use Telnet to simulate GET requests

For web sites, port 80 is open by default, such as in General in the Network request, which is not configured but is accessed through port 80. Of course, you can also type nladuo.cn:80/test.html to access it, but it’s generally not unnecessary.

So we are going to use TCP to connect to port 80 of nladuo.cn using the following command.

telnet nladuo.cn 80
Copy the code

After typing, you can see that nladuo.cn is resolved to IP: 123.206.86.230 (the IP is 191.101.13.124 because of a different server) and a TCP connection is established. At this point MAC and Linux users are ready to enter commands; For Windows, press CTRL + ‘]’ to enter the input mode, then press Enter to switch to explicit input mode, and then enter the command.

As seen in Chrome’s Network, we first send a GET request to access the /test.html resource using the HTTP1.1 version of the protocol, and then tell the server that the Host we are accessing is nladuo.cn. We can then tell the server what user-agent is and what accept-encoding is, but these are not the most important, so just type the first two.

GET /test.html HTTP/1.1(enter)Copy the code

In this case, enter a carriage return after each entry and two carriage returns at the end. After typing two returns, we wait for a moment to see what the server returns to us.

Here you can see 200 OK, time, server information, and so on as you saw in Chrome before….

After the two carriage returns following the corresponding header, you can see the HTML information returned: “It Works”

At this point, the request is finished, and after a while, you can see that the remote server has closed the TCP connection. An HTTP request was officially completed.

Simulation on

What happened during the landing?

From the above example of using TCP to simulate HTTP requests, we know that the client sends a request header and the server returns a corresponding header +HTML, and the TCP connection is closed. But after we log in to the web page in daily use, the browser can remember our login information after refreshing, what is the principle of this?

Here is a simple login page to learn what happened in the login process? The login link is nladuo.cn/crawler_les… . And the privacy link address after login is: nladuo.cn/crawler_les…

First of all, we do not log in, visit the privacy page, pay attention to first open the Network, and then enter the link nladuo.cn/crawler_les… .

You can see that the page is not authorized to view and that Response Headers has a Set Cookie field. Of course, if you don’t do what I told you to do, type in the URL, display the page, then open the Network and refresh the page, you will see the following results.

In the Response Headers Set – Cookie is gone, and Requests much more Headers a Cookie, these two values is still the same, are “PHPSESSID = 9 m8vgq9699fun79t3ks6ljrdh7”.

Of course, no new set-cookie appears in Response Headers no matter how many times the page is refreshed. After the first Request, the Request Headers will take a Cookie. For example, here we check the login page, nladuo.cn/crawler_les… , also brought the Cookie.

We can keep the question in mind here, but we now know that set-cookie is only executed once and is returned by the server; After that, the browser saves the value of set-cookie, takes the value of set-cookie with it every time it accesses the contents of the field, and places it in the Cookie field.

Next, let’s try to log in. Here, the user name and password are both nladuo by default, so we just type nladuo into the input box. After clicking login, we can see that there is a POST request that says “login successful”.

POST has more content-Type and Content-Length fields in the request header than GET, because this time the Data is placed in the Form Data instead of the URL. Content-length is the Length of bytes sent, and content-Type is the Type of bytes sent. This converts fields in the form to key-value pairs, such as uname:nladuo,passwd=nladuo, to uname=nladuo&passwd=nladuo. That’s exactly 26 characters.

Here, of course, we can also use Telnet to simulate POST requests, which readers can take a look at but won’t go into here.

Below, we look at the Cookie, you can find here a Cookie does not change, or “PHPSESSID = 9 m8vgq9699fun79t3ks6ljrdh7”.

The Cookie and Session

This, with a little bit of doubt, leads to the concept of cookies and sessions.

The Cookie we saw above is a client-side technology, while the Session is a server-side technology. The login process is actually when a session starts, the server gives the client an ID, such as set-cookie, which actually represents the beginning of the session, and when the browser closes, a session ends.

In a session, the user’s state is stored on the server, while the client saves a session ID, so when we are not logged in, we request nladuo.cn/crawler_les… Page, the server will be in the cache database PHPSESSID this table lookup with id 9 m8vgq9699fun79t3ks6ljrdh7 fields, whether the user login, and then return to a different page according to the results of the query.

Before we not login, for example, the cache server PHPSESSID 9 m8vgq9699fun79t3ks6ljrdh7 is_login in field is NULL, so the server query to is_login is empty, so don’t show the user privacy page. When we landed, if user name password correctly, the server will give is_login of cache server 9 m8vgq9699fun79t3ks6ljrdh7 field is set to True, when the next time the user requests privacy page, you can see the right to return the result.

Use cookies to simulate login

Now that we know how to log in, let’s simulate this step with code: 1. First login, 2. Save Cookie, 3. Request privacy page with Cookie

# 1. Login first
resp1 = requests.post("http://nladuo.cn/crawler_lesson2/do_login.php",
  data={
    "uname": "nladuo"."passwd": "nladuo"
})

# 2. Save the Cookie returned by the server
print("Set-Cookie:", resp1.headers["set-cookie"])
cookie = resp1.headers["set-cookie"].split(";") [0]

# 3. Request privacy page via cookie again
resp = requests.get("http://nladuo.cn/crawler_lesson2/private.php",
  headers={
    "Cookie":  cookie  # Now use a browser or Telnet to send a Post request login, paste the cookie here
})
print(resp.content)
Copy the code

Run the code and you can see that the login is successful and the privacy page is returned.

Simulate login using requests. Session

While manually managing cookies in the above way always felt a bit cumbersome, the Requests library gave me a more convenient object: Requests. Session, which logs cookies in our session like a browser.

import requests

Create a Session
session = requests.session()
# 2. Log in
session.post("http://nladuo.cn/crawler_lesson2/do_login.php",
             data={"uname": "nladuo"."passwd": "nladuo"})
# 3. Visit the Privacy page
resp = session.get("http://nladuo.cn/crawler_lesson2/private.php")
print(resp.content)
Copy the code

Now, the flow in our code is just like that of a person using a browser. Just log in and request the privacy page.

When you run the code, you see the same correct results.