In the process of browsing the website, we often encounter the need to log in, some pages can only be accessed after logging in, and after logging in, you can visit the website for many times, but sometimes after a period of time you need to log in again. There are some websites, when opening the browser on the automatic login, and will not be invalid for a long time, this situation is why? In fact, there are sessions and Cookies involved, and this section will explain them.

1. Static and dynamic web pages

Before we begin, we need to understand the concepts of static and dynamic web pages. Here is the same sample code as before:

<! DOCTYPE html> <html> <head> <meta charset="UTF-8">
        <title>This is a Demo</title>
    </head>
    <body>
        <div id="container">
            <div class="wrapper">
                <h2 class="title">Hello World</h2>
                <p class="text">Hello, this is a paragraph.</p>
            </div>
        </div>
    </body>
</html> Copy the code

This is basic HTML code, we save it as an.html file, and then we put it on a host with a fixed public IP, Apache or Nginx on the host, so that the host can be used as a server, and other people can access the server and see this page, This makes for the simplest possible website.

The content of this web page is WRITTEN by HTML code, text, pictures and other content are written by THE HTML code to specify, this page is called a static page. It is fast to load, easy to write, but there are great defects, such as poor maintainability, can not flexibly display content according to URL, etc.. For example, if we want to pass a name parameter to the URL of this web page, it will not be displayed in the web page.

Therefore, dynamic web pages emerge at the historic moment, it can dynamically resolve the changes in URL parameters, associated database and dynamic presentation of different page content, very flexible and changeable. Most of the sites we encounter today are dynamic sites that are no longer simple HTML, but perhaps written in JSP, PHP, Python, etc., which are much more powerful and rich than static web pages.

In addition, dynamic website can also realize the function of user login and registration. Going back to the question at the beginning, many pages require login to view. By general logic, after logging in with a username and password, we must be given something like a credential that allows us to stay logged in and access a page that we can only see after logging in.

So what exactly is this mysterious credential? It’s a combination of sessions and Cookies, so let’s take a look.

2. Stateless HTTP

Before we get to sessions and Cookies, we need to understand a feature of HTTP called statelessness.

HTTP stateless means that HTTP protocol has no memory for transaction processing, that is, the server does not know the state of the client. When we send a request to the server, the server parses the request and returns the corresponding response. The server is responsible for this process, and this process is completely independent, the server does not record the state changes before and after, that is, the lack of state records. This means that if the previous information needs to be processed later, it must be retransmitted, which results in the need to pass some additional repeated previous requests in order to get the subsequent response, which is obviously not the desired effect. In order to maintain the forward and backward state, we certainly can’t retransmit all previous requests at once, which would be a waste of resources, especially for pages that require users to log in.

This is where two techniques for maintaining HTTP connections emerge: sessions and Cookies. Session on the server, that is, the server of the website, is used to store user session information. Cookies are on the client side, or the browser side. When the browser visits the web page next time, it will automatically attach Cookies and send them to the server. The server identifies the Cookies and identifies the user, and then determines whether the user is logged in and returns the corresponding response.

We can understand that Cookies save the login credentials. With Cookies, we only need to send the request with Cookies in the next request without re-entering the user name, password and other information to re-log in.

Therefore, in crawler, sometimes when dealing with pages that require login to access, we generally directly put Cookies obtained after successful login in the request header to request directly, without simulating login again.

Ok, now that we understand the concepts of sessions and Cookies, let’s take a closer look at how they work.

(1) session

Conversation, in its original meaning, refers to a series of actions/messages that begin and end. For example, when making a phone call, the sequence of steps from picking up the phone to dialing and ending the call can be called a conversation.

On the Web, session objects are used to store properties and configuration information needed for a particular user session. This way, when a user jumps between Web pages of an application, variables stored in the session object are not lost, but persist throughout the user session. When a user requests a Web page from an application, the Web server automatically creates a session object if the user does not already have a session. When a session expires or is abandoned, the server terminates the session.

(2) Cookies

Cookies refer to data stored on a user’s local terminal by some websites for identification and session tracking.

Session to maintain

So how do we use Cookies to stay state? When the client requests the server for the first time, the server will return a response with a set-cookie field in the request header to the client, which is used to mark the user. The client browser will save the Cookies. When the browser requests the site again, the browser will put the Cookies in the request header and submit them to the server. The Cookies carry the session ID information, and the server can check the Cookies to find out what the corresponding session is, and then determine the session to identify the user status.

When you successfully log in to a website, the server will tell the client which Cookies to set. When you visit the page later, the client will send the Cookies to the server, and the server will find the corresponding session to judge. If some of the variables in the session that set the login status are valid, the user is logged in and is returned to the web content that can be viewed after the login, which the browser can parse to see.

On the other hand, if the Cookies sent to the server are invalid or the session has expired, we cannot continue to access the page and may receive an incorrect response or jump to the login page to log in again.

Therefore, Cookies and sessions need to work together, one on the client side and the other on the server side, to achieve login session control.

Attribute structure

Next, let’s take a look at what Cookies are. Take Zhihu as an example. Open the Application TAB in the browser developer tool, and then there will be a Storage section on the left, and the last item is Cookies, as shown in Figure 2-13. These are Cookies.

Figure 2-13 Cookies list

As you can see, there are many entries, each of which can be called a Cookie. It has the following properties.

  • Name: indicates the Name of the Cookie. Once created, the name cannot be changed.
  • Value: indicates the Value of the Cookie. If the value is a Unicode character, you need to encode the character. If the value is binary data, BASE64 encoding is required.
  • Domain: indicates the Domain name that can access the Cookie. For example, if the Cookie is set to. Zhihu.com, all domain names ending in zhihu.com can access the Cookie.
  • Max Age: The time, in seconds, when the Cookie Expires. Also used with Expires, it is a way to calculate the Expires time. If Max Age is positive, the Cookie expires after Max Age seconds. If it is negative, the Cookie becomes invalid when the browser is closed, and the browser does not save the Cookie in any form.
  • Path: indicates the usage Path of the Cookie. If set to /path/, only pages whose path is /path/ can access the Cookie. If the Cookie is set to /, all pages under the domain name can access the Cookie.
  • Size field: The Size of this Cookie.
  • HTTP field: cookieshttponlyProperties. If this property istrue, only the HTTP header will carry the Cookie’s information, but not throughdocument.cookieTo access this Cookie.
  • Secure: Indicates whether the Cookie is transmitted only using a security protocol. Secure protocols such as HTTPS and SSL encrypt data before transmission over the network. The default isfalse.

Session cookies and persistent cookies

On the surface, session cookies are stored in the browser memory, which will become invalid after the browser closes. The persistent Cookie is saved to the hard disk of the client and can be used next time to keep the user logged in for a long time.

In fact, strictly speaking, there is no session Cookie or persistent Cookie, only the Cookie’s Max Age or Expires field determines the expiration time.

Therefore, some persistent login sites actually set the Cookie validity period and session validity period to be long, and the next time we visit the page, we still carry the previous Cookie, so we can directly maintain the login state.

3. Common mistakes

When talking about session mechanics, there is a common misconception that “just close the browser and the session is gone”, which is wrong. Consider the example of membership cards. Unless the customer takes the initiative to cancel the card, the store will never delete the customer’s information easily. The same is true for sessions, which persist until the program notifies the server to remove a session. For example, programs typically delete sessions only when we log out.

But when we close the browser, the browser doesn’t actively notify the server that it’s going down before closing, so the server doesn’t have a chance to know that the browser is down. The reason for this illusion is that most session mechanisms use session Cookies to store session ID information. When you close the browser, the Cookies disappear, and when you connect to the server again, the original session cannot be found. If the Cookies set by the server are saved to the hard disk, or the ORIGINAL Cookies are sent to the server by rewriting the HTTP request header sent by the browser in some way, the original session ID can still be found when the browser is opened again, and the login state can still be maintained.

And it is precisely because of the close the browser will not lead to the session to be deleted, which requires the server set an expiry time for the session, when the distance between the client use session last time longer than the failure time, the server can think that the client has stopped the activity, will delete session to save storage space.

4. Reference materials

Some of the contents of this section are referenced as follows.

  • Session baidu encyclopedia: baike.baidu.com/item/sessio…
  • Cookies baidu encyclopedia: baike.baidu.com/item/cookie…
  • HTTP cookies wikipedia: en.wikipedia.org/wiki/HTTP_c…
  • Sessions and several state retention schemes understood: www.mamicode.com/info-detail…


This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)