In the last article, we talked about some measures adopted by large-scale website development to ensure high availability. In the practice of website operation and maintenance, in addition to the system availability risks caused by network, server and other hardware failures, there are also risks from the software system itself.

Especially when our website is released online, inevitably, projects need to be packaged and re-released, during this period of time, for the availability of the website, equivalent to the server is in a state of downtime. To design a highly available website, if we update the website quickly, update twice or once a week, then it will be more frequently unavailable for the availability of the system.

Below are some of the assurance tools available that differ from our traditional software development process

1. Website release

The process of publishing a website is actually as effective as a server down, and its impact on system availability is similar to a server down. So when designing a highly available architecture for a website, consider the probability of server downtime not physically once or twice a year, but actually once or twice a week. Maybe you don’t think the app is important, the restart is fast, and users can tolerate one or two outages a year, so there’s no need for a complex high-availability design. In fact, due to the constant release of applications, users are faced with weekly or twice outages.

But after all, website release is a server excuse machine in advance, so the process can be more gentle, less impact on users. Publishing is usually done using publishing scripts, as shown in the figure below

2. Automate testing

Code needs to be rigorously tested before being released to online servers. Even if each new function released is a small increase in the original system function, but in order to ensure that the system does not introduce unexpected bugs, website testing still needs to carry out a comprehensive regression test on the whole website function. You also need to test compatibility with various browsers. In frequently published web applications, the cost, time, and coverage of manual testing are unacceptable.

At present, most websites use eb automated testing technology, using automatic testing tools or scripts to complete the test. Large websites usually develop their own automated testing tools, which can complete the whole testing process of system deployment, test data generation, test execution, test report generation and so on in one click.

3. Pre-release verification

Even after rigorous testing, software deployed to an online server often had problems and failed to start the server at all. The main reason is that the test environment is different from the online environment, especially the application needs to rely on other services, such as database, cache, public service, and some third-party services, such as telecom SMS gateway, bank e-banking interface, etc.

Maybe the database table structure is different; Communication failure may be caused by interface change. The connection may fail due to a configuration error. These problems can lead to application failure, perhaps because the dependent service line environment is not ready. Therefore, when the website is released, the code package that passes the test is not directly released to the online server, but first released to the pre-release machine. The development engineer and test engineer conduct pre-release verification on the pre-release server, execute some typical business processes, and confirm that the system has no problems before the official release.

4. Code control

The control of the code is mainly on our development branch, if the trunk is packaged and released, then the problem can be developed in the branch, and then when the development is finished, the branch is synchronized to the trunk. This does not affect the availability of the released system during development.

5. Automate publishing

Versions of the site are released frequently, the whole release process requires the cooperation of many teams, multiple code branches merged back into the trunk before release can be conflicted, pre-release verification is also risky, and each release is equivalent to an outage.

6. Grayscale release

After an application is successfully released, faults may still occur due to software problems. In this case, you need to perform release rollback, that is, uninstall the newly released software and release the software package of the previous version to recover the system and rectify faults.

For example, the number of servers in a large application cluster exceeds 10,000. Once a failure is discovered, it can take a long time to roll back even if you want to release it, and you can only watch the failure time increase and become anxious. To cope with this situation, large websites will use gray model, the cluster server is divided into several parts, each day release part of the server, watch runs stably without fault, continue to release part of the server, the next day for a couple of days to finished the whole cluster release, if found problems during, only need to roll back the published a part of the server.

To be continued…