Business scenario: A problem was reported recently. When a project was updated iteratively, some users reported that the screen would go blank after entering the project, and CTRL +F5 was required to force refresh to pull the latest one. The reason is that the HTML page accessed by the user after MY update is still the old page, so the JS in it references the old JS (md5 suffix), and the old JS address will be deleted after my update, so 404, the page shows a blank screen…

At first I added the following code:

<meta http-equiv="Cache-Control" content="no-cache, no-store, must-revalidate">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="0">
Copy the code

Yes, invalid, in fact, the effect of this paragraph on different browsers is not recommended.


HTTP header field structure

List only parts

1. General header field (the header used in both request and response packets) :

  • Cache-control: controls the Cache behavior

    Cache-Control

    Cache-Control: max-age=<seconds>
    Cache-Control: max-age=<seconds>
    Cache-Control: max-stale[=<seconds>]
    Cache-Control: min-fresh=<seconds>Cache-control: no-cache: forcibly authenticates the server (strong Cache is prohibited and negotiated Cache is available) cache-control: no-store: disables the use of Cache cache-control: No-transform cache-control: only-if-cached: obtains resources from the CacheCopy the code
  • Date: indicates the Date and time when the packet is created

2. Request header field

  • If-modified-since: Compares the update time of the resource

3. Response header field

  • ETag: matching information of resources

4. Entity header field

  • Expires: Indicates the date or time when the entity body Expires
  • Last-modified: The Last modification date of the resource

Browser Cache classification

At present, the mainstream browser cache is divided into two types, strong cache and negotiated cache. Their matching process is as follows:

(1) Before sending a request, the browser determines whether the strong cache policy is matched according to the expires and cache-control in the request header. If so, the browser directly obtains resources from the cache and does not send the request. If not, go to the next step.

(2) If the strong cache rule is not matched, the browser will send a request and judge whether the negotiation cache is matched according to the last-Modified and ETAG of the request header. If the negotiation cache is matched, the browser will directly obtain resources from the cache. If not, go to the next step.

(3) If the first two steps are not matched, obtain resources directly from the server.

When a new page is found, the Status Code is always 200 (from disk cache). Then refresh on this page to get 304.

Strong cache

From disk cache, the browser has been using its own cache, so it didn’t pull my latest updated file at all. In this case, I have to force refresh to get (200). This state is the result of strong caching and is controlled by expires/cache-control:

1. Expires (HTTP version 1.0 works) is an absolute time; Is a feature of HTTP1.0. If the browser doesn’t expire this expires expires, the cache is still valid, hits the strong cache, and reads the resource directly from the cache. Cache-control is later introduced in HTTP1.1 because browser and server time can be significantly different.

2. The cache-control value is the relative time. When the browser requests the resource for the first time, the content of the response header is cached. Subsequent requests first check the response header from the cache and calculate the cache validity time using the date of the first request and cache-control. If the browser does not exceed the cache validity time, the cache is still valid, hitting the strong cache and reading resources directly from the cache.

Cache-control uses an interval to solve the problem of time synchronization. But cache-Control is unique to HTTP1.1 and does not work with HTTP1.0, and Expires works with both, so sending both headers at the same time is a better option in most cases. Cache-control is preferred when the client can parse both types of headers.

Note :Chrome has updated its cache policy in later versions. From cache is now from Disk cache and FROM Memory cache.

Negotiated cache (weak cache)

The result of a later refresh on the page is actually the negotiated cache.

If last-Modified /ETag is configured, the browser will still send a request to the server to ask whether the file has been Modified. If not, the server will send a 304 reply to the browser, telling the browser to directly access the data from its local cache. If so, send the entire data back to the browser;

You might think that last-Modified is enough to let the browser know if the local cached copy is new enough, so why Etag? Etag was introduced in HTTP1.1 to solve several last-Modified problems:

  1. The last-Modified tag is only accurate to the second level, and it will not accurately indicate the freshness of a file if it has been Modified more than once in a second
  2. If some files are generated regularly, sometimes the contents are unchanged but last-Modified changes, making the file uncacheable
  3. The server may not obtain the correct file modification time or the time on the proxy server may be inconsistent with that on the proxy server

Etag is the unique identifier of the corresponding resource on the server side that is automatically generated by the server or generated by the developer, which can more accurately control the cache. Last-modified and ETag can be used together. The server will verify the ETag first. If the ETag is consistent, the server will continue to compare last-Modified, and finally decide whether to return 304. Etag server generation rules and strong and weak ETags can be found in interactive Encyclopedia -Etag and HTTP Header Definition.

In general, cache-Control /Expires is used in conjunction with last-Modified /ETag because even if the server sets the Cache time, when the user hits the “refresh” button, the browser ignores the Cache and continues to send requests to the server, Last-modified /ETag will then be able to take advantage of 304 to reduce response overhead.

Summary of strong and negotiated caching

  1. Strong cache phase: first searches for the resource locally. If the resource is found and there are no other limitations (such as cache validity time), the strong cache is hit and 200 is returned. The strong cache is directly used without sending requests to the server. (Does not send requests to the server)
  2. Weak cache phase: it finds the resource in the local cache and sends an HTTP request to the server. The server determines that the resource has not been changed and returns 304 to let the browser use the resource. (A request needs to be sent to verify that local cache can be used)
  3. Cache failure phase (rerequest) : When the server discovers that the resource has been modified or the cached resource is not found locally, the server returns data for the resource.

New pages are validated for strong caching.

F5 refresh causes strong cache invalidation.

CTRL +F5 All caches are invalidated.

Heuristic cache

Then the problem must be caused by the cache, why there is such a situation, you can check.

Found that first visit to 200 (from disk cache), Request Header (Request Header) according to Provisional headers are to, can’t view the relevant information.

The usual possible situations with this hint are:

  1. Cross-domain, request blocked by browser
  2. Request blocked by browser plug-in
  3. Server error or timeout, no real return
  4. Strong cache from disk cache or from memory cache is not displayed

So this is case number four. The request header does not contain caches related fields. If the local cached version is valid, the request is read from the cache, no request is sent, and a false request header is displayed.

I found a very inspiring article:

Thoroughly understand the Http cache mechanism – based on the cache strategy of three factors decomposition method

Since there are no cache-related fields in the request header, the cache expiration policy is used. The impact factor is that: in the absence of any cache-related fields, the client calculates the difference between Date and Last-Modified in the response header and takes 10% of this value as the cache expiration period. During this period, the local cache data will be directly used without any request to the server (except for mandatory request). After the cache expires, the server will be requested again with last-Modified time to the server for comparison and decide whether to load cached data from the local server according to the response status of the server.

Then the actual return header was compared to find that the problem did appear in last-Modified. On the page in question, the last-Modified date does not change and seems to remain in May.

Later, it was found that the production file output time of gitlab CI building image was wrong. After correction, it was restored to normal.

The cookie and session

Cookie and session are used to make up for the stateless feature of HTTP protocol. It is impossible for the server to know whether two HTTP requests come from the same user. Cookie and session can be used to allow the user to log in only once, and the server can know whether a request needs to be re-logged in.

  1. Cookies: When the client accesses the server normally for the first time, the server returns a cookie related to the user information in Response Headers. After receiving the cookie, the client saves the cookie locally. The next time the client sends a request, it will carry this cookie in the Request Headers. The server receives this cookie and knows the user’s status. The default value is -1, indicating that the cookie will become invalid when the browser is closed. The value is 0, indicating that the cookie will become invalid immediately, which is equivalent to deleting the cookie (there is no way to delete the cookie). The server and client can set the cookie, but cannot operate the cookie under another domain name.

  2. session: The first time the client accesses the server properly, the server generates a sessionID to identify the user and store the user information (the server has a special place to store the sessionids of all users), which is returned as a value of the cookie in response Headers. After receiving the cookie, the client saves the cookie locally. When sending the request headers again, the server will carry the sessionId in the Request Headers. By searching the sessionId, the server will know the user status and update the last access time of the sessionId. Sessionids can also be set to expire, such that if a session has not been renewed for 60 minutes, the server will delete it.

So, the cookie is stored on the client, the session is stored on the server, and the session depends on the cookie.


Note: docs.gitlab.com/ee/ci/yaml/… (Clone the code every time when forcing GitLab CI, to ensure that the code timestamp is the build start time):

variables:
  GIT_STRATEGY: clone
Copy the code

Other references:

Illustrated HTTP

Web caching mechanism series

HTTP Cache