Summary: DevOps seeks shorter iterations and more frequent releases. But the more you publish, the more likely you are to introduce failures. More failures will reduce the availability of services, which in turn will affect the customer experience. Therefore, in order to ensure the quality of service and guard the last hurdle of publishing, Alibaba has gradually developed a publishing strategy to meet the requirements of DevOps.

The author | | ali heavy silver source technology to the public

preface

DevOps seeks shorter iterations and more frequent releases. But the more you publish, the more likely you are to introduce failures. More failures will reduce the availability of services, which in turn will affect the customer experience. Therefore, in order to ensure the quality of service and guard the last hurdle of publishing, Alibaba has gradually developed a publishing strategy to meet the requirements of DevOps.

Before we start talking about Ali’s practices, let’s take a quick look at some common publishing strategies, the scenarios where they work, and the pros and cons.

A common publishing strategy

1 Release downtime

Outage Publisher shuts down services before release, stops user access, and then upgrades all services at once. This release strategy tends to be relatively infrequent and requires adequate testing prior to release.

Features of downtime release are:

  • All the components that need to be upgraded are consolidated into one release
  • Most of the applications in a project will be updated
  • The development and testing process before release often takes a long time
  • If something goes wrong at release time, it can be costly to fix and roll back
  • A downtime release takes a long time to complete and requires many teams to work together
  • The client side and server side are often required to upgrade synchronously

Downtime is not a good fit for Internet companies because there is too much time between launches, too much time between feature introduction and market entry, too little sensitivity to market reaction, and too much disadvantage in a fully competitive market. Each release also brings financial losses due to downtime.

Advantage:

  • Simple, not much need to consider compatibility issues when old and new versions coexist

Disadvantage:

  • The service is not available during publication
  • It can only be released during periods of low business (often at night) and requires many teams to work together
  • It is difficult to roll back after a failure

Suitable scene:

  • Development test environment
  • Non-critical application, user influence is small
  • Scenarios where compatibility is difficult to control

2 Canary released

Canary to release the term derived from the beginning of the 20th century, the British coal miners next well before mining, will carry the caged canary in the mine, if the high concentration of toxic gases such as carbon monoxide in the mine, before impact miners, canary human performance is more sensitive than fast, canary after poisoning, mine workers has known the evacuate immediately. Canary release is to release the new version of the entire software to some users before releasing it to all users, and test it with real customer traffic to ensure that the software will not have serious problems and reduce the risk of release.

In practice, Canary Publish typically releases to a small percentage of machines, such as 2% of servers for traffic verification, and then gets quick feedback from that, based on which you decide whether to expand the release or roll back. Canary release is usually combined with a monitoring system to monitor indicators to observe the health status of the canary machine. If the Canary test passes, all remaining machines are upgraded to the new version, otherwise the code is rolled back.

Advantage:

  • The impact on user experience is minimal, and only a small number of users will be affected during the Canary launch process
  • Release security can be guaranteed

Disadvantage:

  • The number of Canary machines is relatively small, and some problems can not be exposed

Applicable scenario:

  • Monitoring is relatively complete and integrated with the release system

3 grayscale/scrolling release

Grayscale release is an extension of Canary release, which is to divide the release into different stages/batches, and the number of users in each stage/batch increases step by step. If the new version does not find problems in the current phase, it increases the number of users to the next phase until it is expanded to all users.

Grayscale publishing can reduce the risk of publishing and is a zero downtime publishing strategy. It gradually switches from one version to another by switching the routing weights between the online versions. The entire release process takes a long time, during which the old and new code coexists. Therefore, compatibility between versions needs to be considered during the development process, and the coexistence of the old and new code does not affect the functional availability and user experience. In the event of a problem with a new version of the code, the grayscale release can be quickly rolled back to the old version of the code.

Combining feature switch and other technologies, grayscale publishing can realize more complex and flexible publishing strategy.

Advantage:

  • User experience impact is relatively small, do not need to stop publishing
  • Ability to control release risk

Disadvantage:

  • The release time will be long
  • Complex publishing systems and load balancers are required
  • Compatibility with old and new versions needs to be considered

Applicable scenario:

  • Suitable for release in production environments with high availability

4 blue and green release

A blue-green deployment is one that has two identical, independent production environments, one called the “blue environment” and the other called the “green environment”. Among them, the green environment is the production environment that the user is using. When you deploy a new version, you first deploy the new version to the blue environment, and then run smoke tests in the blue environment to check that the new version is working. If the tests pass, the publishing system updates the routing configuration to shift user traffic from a green environment to a blue environment, which becomes a production environment. This switch is usually done in less than a second. If something goes wrong, cut the route back to the green environment and debug it in the blue environment to find the cause of the problem. As a result, blue-green deployments make it possible to make the new version available to all users at once with a single switch, and the new functionality becomes visible to all users at once.

Advantage:

  • Upgrade switching and fallback are very fast
  • Zero downtime

Inadequate:

  • A one-time full switch, if the release of problems, will have a relatively large impact on the user
  • It takes twice as many machine resources
  • The middleware and the application itself are required to support traffic switching in the hot standby cluster

Applicable scenario:

  • Machine resources are relatively surplus or allocated according to demand (relying on cloud manufacturers)

5 A/B testing

A/B testing is very similar to grayscale publishing and can be distinguished in terms of the purpose of publishing. AB testing focuses on making decisions based on the differences between A and B releases, and ultimately choosing A release to deploy. AB tests tend to be more decision-making than grayscale releases, and are more flexible in switching between weights and traffic than Canary releases.

For example, if A feature has two implementations A and B, with fine-grained flow control, 50% of users are always directed to A implementation and the remaining 50% are always directed to B implementation, by comparing the conversion rates of A implementation and B implementation, The A implementation with A higher conversion rate was selected as the final version of the functionality.

Advantage:

  • Quick experiment capability
  • The user experience has little impact
  • You can use the production environment flow for testing
  • You can test for specific users

Inadequate:

  • Requires more complex business traffic identification and control capabilities
  • More complex compatibility issues with the old and new versions need to be considered

Applicable scenario:

  • For business exploration and innovation testing
  • Multiple scenarios need to be decided upon

6 Traffic isolation environment release

In the release strategy, release units are used, but a function module is often composed of multiple applications together to provide services, even if the current release of application exception, the exception is not reflected in the current application, in the case of complex, exceptions will be delayed until its downstream applications, It’s important to know how to detect such problems without affecting the user experience. In addition, we sometimes want a new version of the code to go live and affect only a small number of users. Traditional grayscale publishing, however, is unable to identify business traffic, so even if only one machine of an application has a problem, it may affect all users.

In the grayscale publication on the left of the figure below, there is a certain probability that all machines in App1 will be routed to the red App2 machine in question. On the right side of the isolated environment release, the new version of the code will be released in the full link isolation environment first, even if the issue occurs in the release, it will only affect a small number of users.

Advantage:

  • Able to find complex problems that involve multiple applications
  • When a failure occurs, it affects only a small number of users

Inadequate:

  • The traffic isolation environment needs to be independently monitored
  • The system design is complex and requires the middleware and all applications on the link to be able to recognize the traffic

Applicable scenario:

  • More core production business scenario

II Alibaba releases best practices

We will follow the launch process to introduce the best practices of Alibaba’s launch.

1 Release Plan

Before release, we should fully verify the release function module and consider how to stop the bleeding if the release introduces a fault. Therefore, it is important to write a list of plans for the release prior to release. A typical release plan would look like this:

Participant of this release

  • developers
  • Test one
  • The code Review

Risk description of the testing process of the release content Verify the plan Online the release steps of the hemostatic plan if there is a problem

  • Released in x batches
  • Pause for x hours after the first x batch is published

Different environments use different publishing strategies

Several publishing strategies introduced above all have their own advantages and disadvantages. It is necessary to choose the appropriate publishing strategy according to the characteristics and requirements of your own scenario.

Generally speaking, the test environment is used for preliminary functional testing, so the code will be updated and released frequently. If the method of grayscale release is adopted and the batch of release is set to be relatively large, the development efficiency will be greatly reduced. At this time single machine or multi machine single batch shutdown release is actually a choice not to do.

For pre-release environments, where you need to consider not only your own testing needs, but also the testing needs of other developers upstream and downstream, a single downtime release is no longer appropriate, and you can set up two releases.

For the online environment, you can release the isolated flow environment first, and then release the online environment in more batches.

3. Pay attention to the monitoring alarm in the release

Release strategies alone cannot prevent failures, and it is important to carefully observe your application’s monitoring data during and after release. The application’s core metrics monitoring data, such as QPS, RT, success rate, and number of errors, help users detect failures as early as possible. In addition, in a production environment, if the batch number is set to be small and the number of machines released per batch is small, it is important to configure independent monitoring of the released machines even if some monitoring metrics fail because the data volume is small and may be submerged in the overall monitoring data.

4 Canary released and unattended

Ali inside the vast majority of applications in many rooms/units deployment, there may be a scene, the same code and configuration in some room/unit normal, under other unit/room can appear fault, it is necessary to released in batches, a combination of all the rooms/units are released in the first, so that problems can be exposed as early as possible. In addition, developers tend to focus on the first few batches of releases, and if problems occur in the later batches, developers may not be able to respond quickly.

Singularity is to solve the problem of disaster tolerance and expansibility. The above picture shows Alibaba’s Singularity deployment architecture. In addition, the application of monitoring items are generally many, in the case of a long release cycle, it is not required to ask the R&D personnel to focus on each monitoring item all the time, the need for some intelligent solutions to help R&D to find those monitoring items that need to focus on.

In order to solve the above two problems, Alibaba designed and implemented its own Canary Release Strategy. Canary released from the application of each room/unit by 10% in the first machine unattended intelligent monitoring system will on the part of the machine set up independent monitoring, monitoring for each item, unattended will contrast the published and unpublished machine monitoring index data, released at the same time, contrast before and after the release of monitoring data, if discovery is unusual, It will be pushed to the research and development personnel for further judgment.

This kind of canary release strategy can help R & D find problems as early as possible, reduce the workload of R & D personnel and improve the efficiency of R & D.

Continuous integration and publishing

By choosing a proper release strategy and following the best practices described above, the risk of a release can be kept to a minimum, even less than the risk of an downtime release. In fact, short release cycles with a small amount of code per release are a good release practice. Because of the long deployment intervals, each deployment will involve more code changes, resulting in more risk of defects and downtime. In this case, there is a tendency to add more reviews in order to reduce release risk, which in fact has little effect on release risk other than significantly increasing deployment time. This is an enhancement loop that is getting worse and worse, and we need to reverse this vicious cycle with continuous deployment at high frequencies.

Three summary

Agile development can shorten the time to market, allow consumers to get the features they want faster, and allow product teams to get feedback from consumers faster and iterate on the product. In order to solve the release risk caused by frequent release under agile development, this paper introduces a variety of release strategies, including the advantages and disadvantages of each release strategy, and the applicable scenarios. The comprehensive application of these patterns under different scenarios can deliver high-quality products more quickly.

This article is the original content of Aliyun, shall not be reproduced without permission.