preface

The significance of gray release is not introduced here, you can read these two articles first

Microservice Deployment: Blue-green Deployment, Rolling Deployment, Grayscale Publishing, Canary Publishing

Grayscale Publishing: Grayscale is Simple, Publishing is Complex

To put it bluntly, the gray scheme allocates a certain proportion or selects users with special identities to try the latest version of the product in advance, so that problems can be found as soon as possible and the impact of problems can be minimized. Different companies have their own unique grayscale process. Here we only discuss one small part of the grayscale scheme, user assignment.

Gray process

Coarse-grained gray scale flow chart (there are details)

Coarse-grained processes may not seem problematic, but if you look at them in detail, you’ll see that they’re full of bugs

  • The first access without cookie must go to the online cluster, but if the gray scale is hit, the following asynchronous requests will be shunted to the beta cluster, resource confusion
  • After the cookie expires in the beta cluster (browser automatically clears), the following asynchronous requests will be gray scale allocated again. If the gray scale is not matched, the following asynchronous requests will be transferred to the online cluster, resulting in resource confusion
  • If the failure time is set short, the gray scale can not be achieved

Next, optimization is inevitable

A couple of big questions

1. Problems with synchronous and asynchronous resources

Description:

In the same session, synchronous and asynchronous resources flow into different clusters due to different timing. Assume that resources in the Online cluster and beta cluster are inconsistent

Scene:

1. Synchronous online Asynchronous beta: The synchronous resources flow into the online cluster without cookies, and the cookie is set in the gray level of the synchronization. After that, the asynchronous requests will flow into the beta cluster

2. Synchronous beta Asynchronous online: Synchronous resources flow into beta cluster under the condition of cookies, and then the cookies become invalid. Subsequent asynchronous requests will flow into online cluster


Scheme a) After gray matching in the middle stage of node, it is re-agent back to NgniX for shunting. (1-, -2) {1: valid, 1- : partially valid, -1: invalid, same below}

Solution b) Beta cluster resources are compatible with Online clusters. (1, -2)

Option C) Beta cluster independent domain name (302), using domain name to distinguish online from beta. (-1, 2)

Solution B and C solve scenario 1 and 2

2. Grayscale cookie expiration or reset problem

Description:

The following scenarios occur when disconf configurations are updated during a session or cookies expire naturally, resulting in incorrect resource requests

Scene:

3A. Set gray level configuration before synchronization request (online -> beta, resource synchronization)

3b. Close gray level configuration before synchronizing request (beta -> online, synchronize resource synchronization)

4A. Reset grayscale configuration after synchronous (online) request and before asynchronous Request (beta)

4b. Reset grayscale configuration after Synchronous (beta) request and before asynchronous Request (online)

5a, reset grayscale configuration before next synchronization request (online -> beta, synchronization resources are not synchronized)

5b, reset grayscale configuration before next sync request (beta -> online, sync resources are not synchronized)


A) Ibid. (3a, 3b, -4a, -4b, -5a, -5b)

Plan B) Ibid. (3a, -3b, 4A, -4b, 5a, -5b)

C) Ibid. (-3a, 3b, 4a, 4b, -5a, 5b)

Scenarios 3, 4, and 5 can be solved by combining B and C

3. The validity period of grayscale cookie

Description:

Assuming that the above problems have been solved, what is a reasonable maxAge for cookies?

  • The validity period is short, for example, 10 seconds

Problem: If a user visits a page for more than 10 seconds, then the user’s asynchronous requests will switch back and forth between the online and beta clusters. Although the problem of resource confusion is solved, the stress on the beta cluster will increase exponentially.

At the same time, in terms of the proportion of target users allocated, all users will be diverted to the beta cluster within one day, so the gray scale will be meaningless and bring great risks

  • Longer duration of validity, such as 1 day or more

Problem: The advantage of having a long expiration time is that it avoids the fatal disadvantage of having a short expiration time. The beta cluster has a lower percentage of incoming users and server stress.

On the other hand, if the beta cluster fails and goes down, or we take the beta cluster offline. It will result in the feedback of gray scale users within 1 day is 404, and there is no solution, can only wait for the cookie expiration or users take the initiative to change the browser. The result is a flood of customer service calls, followed by “junk website!” This is totally unacceptable.

  • A moderate validity period, such as 10 minutes to 1 hour

Generally speaking, if the site is not a production tool, the user’s visit cycle will not exceed 1 hour. Even if the user does not have the habit of closing the page, it will not cause much impact on the site to operate again after 1 hour.

Although the 404 caused by the outage is also unsolvable, the loss can be minimized

conclusion

Generally speaking, plan B and C can basically solve the above problems.

Beta cluster resources are compatible with online cluster. Static resources are published to CDN for a long time, so only asynchronous resources need to be synchronized.

Cluster independent domain name (302). Domain name is used to distinguish online and beta, and domain isolation is performed. The current session operation of users can be maintained in beta cluster even if cookies fail.

In addition, for plan A and different business scenarios, it can also play a certain role, such as avoiding cross-domain requests.

The problem is relative and the plan is flexible. Different types of systems use different problems, and all we can do is think of a solution to the problem.

If you have a better solution, please feel free to comment! Thank humbly!


Please indicate the source for reprinting

By Zwwill Kiba

Initial address: github.com/zwwill/blog…