Growingio SaaS Product CI/CD Practice (1)

Author: Hao Kuojun

Growingio QA Leader, previously worked for HP and Qihoo 360 in China. Lead QA team responsible for Growingio product line quality assurance, currently focusing on DevOps practice to help team improve quality and efficiency.

purpose

This article mainly describes Growingio’s past practices in SaaS product line CI/CD. Due to historical reasons, part of the tool chain used by the company is relatively small, and the current CI/CD process still has a lot of room for improvement, but some practical experience is of certain reference significance.

What is the CI/CD

CI/CD is a method of frequently delivering applications to customers by introducing automation in the application development phase. The core concepts of CI/CD are Continuous Integration, Continuous Delivery and Continuous Deployment. The industry understands CI/CD as follows.

CI Continuous Integration

Continuous integration is a development practice in which developers frequently commit code to the trunk, and the newly committed code needs to be validated through compilation and automated test flows before it is finally merged into the trunk.

Continuous integration is the process of automatically detecting, pulling, building, and (in most cases) unit testing and static quality analysis after source code changes.

The goal of continuous integration is to quickly ensure that new changes committed by developers are good and suitable for further use in the code base. The process execution and theoretical practice of CI allow us to determine whether the new code and the existing code will integrate properly.

CD Continuous Delivery

Continuous delivery automatically releases validated code to the repository after completing the automated process of building and unit and integration testing in CI. To achieve an efficient continuous delivery process, it is important to ensure that CI is built into the development pipeline. The goal of continuous delivery is to have a code base ready to deploy into production.

In continuous delivery, each phase, from the consolidation of code changes to the delivery of production-ready builds, involves test automation and code release automation. At the end of the process, the operations team can quickly and easily deploy the application into production or release it to the end user.

CD Continuous Deployment

The final stage for a mature CI/CD Pipeline is continuous deployment, which automatically releases the application to production.

Continuous deployment means that all changes are automatically deployed to the production environment, but for business reasons, you may choose not to deploy. If continuous deployment is to be implemented, continuous delivery must be implemented first.

Continuous delivery does not mean that every change to the software has to be deployed into production as soon as possible, it means that any code change can be deployed at any time. Continuous delivery represents a capability, while continuous deployment represents a means. Continuous deployment is the highest stage of continuous delivery.

How to implement CI/CD

There are several ways to implement a set of CI/CD processes:

1. Buy off-the-shelf products or services

Many mature products offer full DevOps functionality, such as Atlassian, Microsoft’s Azure DevOps, AliCloud, Coding.net, and so on. New startups can buy the right products and services, and can quickly set up a complete DevOps infrastructure with the tools and best practices provided by the DevOps product. Many DevOps services are integrated with cloud vendors, and it is not expensive to use the corresponding DevOps service if you already use the corresponding cloud service product.

2. Integrate existing tools

This is the most common practice in most companies at present, integrating internal project management, code management, product management and other tools, or doing some simple secondary development to achieve a complete CI/CD process. The overall cost is controllable, the existing infrastructure is used for transformation, convenient and flexible, and can be customized according to the company’s own process.

Most companies adopt this approach, because at the beginning of most companies, it is necessary to survive before development, do not put engineering efficiency in the first place, will use some separate tools (such as code management) to solve a problem in a certain area. As the company grows and becomes more concerned with engineering efficiency, integrating existing tools is obviously the most common choice, and Growingio is no exception.

3. Use some open-source DevOps platform

There are also some open source products that have been redeveloped and integrated to form a complete DevOps platform, such as Toothfish. If you don’t want to buy a cloud service and don’t want to do secondary development, try one of these tools.

But it is worth noting that the use of open source tools resources will not necessarily less, especially the problems need to solve, or tools can’t meet our needs when needed a second transformation, all need to in-depth research and understanding of tools, so choose this kind of product also need to consider whether matching with the company’s current technology stack, the maintenance ability.

4. Design and develop the implementation by myself

We made our own wheels and started to design, develop and implement a set from scratch. Obviously, this way is the highest investment cost, but it is also the most flexible and can best meet the needs of the enterprise itself. Generally, only some super large enterprises will spend resources on research and development. For example, some cloud service products mentioned in the first category are initially developed and used by enterprises themselves. After maturity, they provide services as products to the outside world, so as to obtain more economic returns.

Tools Used

If you want to do a good job, you must first use the tools. In order to achieve an economical and efficient CI/CD process, choosing the right tools can achieve twice the result with half the effort. A typical CI/CD process construction requires tools with at least the following capabilities.

Code repositories, which require version control software to keep the code maintainable and serve as a repository for the build process
A continuous integration server that automates build, test, deploy, and other tasks
A centralized artifact management repository for build artifacts for deployment
Automated deployment tool, which automatically deploys the built application to the target server

Code management Phabricator

Growingio chose to privatize Phabricator, a powerful software development collaboration tool similar to GitHub and GitLab, as a development collaboration tool. It supports Git, Mercurial and SVN code repository hosting, code Review, command line tools, task management, Kanban, Wiki, automatic rules, Webhook, API and many other functions.

Jenkins, continuous integration server

There are many automation tools that can achieve continuous integration, and one of the most popular open source tools is Jenkins. Jenkins supports multiple mission types, 1000+ plugins, community and ecology. Jenkins 2.0 provides a Pipeline as Code feature that allows the definition of a CI/CD Pipeline to be version-managed and supports GITOPS. In addition, Jenkins supports the master-slave architecture implemented in various ways, in which the master-slave mode implemented by Kubernetes’ powerful choreography and scheduling capabilities can realize the dynamic creation and destruction of slave nodes, which greatly improves the CI execution efficiency and resource utilization.

Quality management platform SonarQube

SonarQube is a code quality management platform that detects potential bugs, security vulnerabilities, code specifications, duplicated code, lack of unit tests, and other code quality issues in a project, and provides a UI interface for review. You can maintain and improve code quality by controlling thresholds for related code quality metrics. SonarQube’s free Community Edition already meets most of your project needs, and it’s easy to integrate with third-party CI/CD, Code Review tools.

Product library management Nexus

Nexus is a product warehouse management tool developed by Sonatype, which supports the management of dozens of products such as Maven, NPM, PyPI, Docker, Helm, etc. In addition, it also supports Webhook, REST API can be easily integrated with third party tools. The free, open source version of Nexus (known as Nexus Repository OSS) provides functionality that meets most of our needs.

Deployment tool Capistrano

Capistrano is a free, open source remote server automation tool implemented in Ruby. Capistrano runs in an SSH agent-free mode, easily managing multiple services by installing a single client. Capistrano provides a set of DSLs and workflows for deployment and rollback that make it easy to automate remote deployment and rollback of services. At the same time, you can easily extend the functionality through custom plug-ins or scripts to achieve personalized requirements.

In addition to the above tools also used Kubernetes, Docker and other tools. There are many alternatives to the capabilities provided by the tools mentioned here, depending on the company’s specific technology stack, current tools in use, engineering practices, deployment patterns, and many other factors.

Source code management policy

As the starting point of CI/CD, source code management will be deeply integrated with CI/CD Pipeline. Different code branching strategies will affect the design and implementation of CI/CD Pipeline. Therefore, branch management strategies must be well designed according to the current collaboration process of the company before starting to design CI/CD.

There are many schemes for Git branch management, each of which has its own advantages and disadvantages and has its own trial scenarios, among which the more famous ones are Git-flow, Trunk-Based and GitHub-Fow, and each enterprise will customize its own branch strategy, such as Alibaba’s AoneFlow.

Growingio tends to use a trunk-based branching strategy, ideally “Trunk development, Trunk release”, which requires a high level of code development quality.

The current development practice of the company is not able to meet such requirements, so the next best thing is to use “branch development, trunk release”. When the branch life cycle is very short, there is no code merge conflict, which is basically equivalent to trunk development. The branch release strategy is also adopted temporarily when a Bug in the trunk branch affects the release.

Branch instructions

The Master branch

The Master branch is the latest branch of the code integration trunk, and it is also the Release branch. Generally, the Master branch can only be entered after the code has been fully tested. The Master branch is ready for Release.

This branch prohibits direct git push. It must submit modified Diff (equivalent to GitHub’s PR and GitLab’s MR, which is the unit for Code Review in Phabricator), pass Code Review and test acceptance before merging.

The Release branch

The Release branch is a temporary branch, which is used to deal with cases where the Master branch does not meet the online requirements but needs to be urgently launched.

For example, if a change is found to introduce a serious Bug during the launch process, in order not to affect the normal release of other changes, it is common to remove the faulty Commit and create a temporary branch to continue the release.

Feature branch

The functional development branch, or integration test branch, is the most active branch in the development process and is integrated every time code is committed to this branch. In the microserver development mode, it is difficult for developers to build a complete integrated environment locally, so the integrated testing environment on the cloud is needed to help developers complete the joint testing.

If you pass integration testing, QA acceptance, and production acceptance before your code goes into the trunk, you will significantly reduce the risk of your code going into the trunk. Of course, make sure that the lifetime of the Feature branch is short enough to avoid code merge conflicts.

Trunk based is opposed to the use of Feature branches with long life cycle, and encourages the use of Feature switches and abstract branching techniques to merge code into Trunk branches as soon as possible without affecting Trunk branch functions.

The Local branch

Developer local branch, which is local because Phabricator allows you to submit Diff for code Review and deployment without creating a remote branch.

After local debugging and unit testing, the Code of the development branch will submit Diff for Code Review based on the trunk (Master branch) or Feature branch, and enter the corresponding Base branch after passing the Review.

Feature Development Process

Usually, to realize a relatively large Feature, multiple components at the data side, multiple micro-services at the server side and collaborative development at the front end are required. EM (Engineering Manager, Engineer Manger) divided the Feature Ticket into several sub-tasks during the Sprint plan, which were assigned to different development to realize.

Developers of each module create Feature branch from Master and push it to remote, as shown in Feature 1 branch in the figure below
Developers of corresponding modules can create a local development branch from the Feature branch to implement a sub-task. When multiple developers work in a code repository at the same time, they can create their own local branches respectively, as shown in the following figure: Dev1, Dev2 branches
After the developer develops and debugs locally and passes the unit test, the developer submits Diff for Code Review (CR). If the CR fails, the developer will modify the Code and update the Diff until it is merged into Feature 1 branch after passing the CR, and then deletes the local development branch. The Dev1 branch is shown below. When another development finishes local development and needs to submit Diff, it should first Rebase Feature branch, then submit Diff, pass CR and merge into Feature 1 branch, and delete local branch, as shown in the figure DEV 2 branch. When you need to continue implementing new functionality or fix bugs in the Feature 1 branch, create a new local branch
The code of the Feature branch will be deployed to the corresponding integrated test environment for multi-terminal developers to conduct joint testing
After the joint inspection of the Feature branch is passed, another Diff is submitted based on the Master branch, and then a QA test is submitted. After the QA test is passed, it is merged into the Master branch and the corresponding Feature branch is deleted
If there is only one developer working on a FEATrue branch of the code repository, you can push directly, such as the FEATrue 2 branch, submit Diff for Coded Review after the code passes the integration test, and CR passes into the trunk branch. However, even if there is only one developer, it is recommended to submit the code to the Feature branch by submitting Diff, to prevent the final submitted Diff from being too large, which is not beneficial to CR

Tips: Each Diff commit should be preceded by a Rebase or Merge operation with the remote base branch to prevent code Merge conflicts.

The above process may seem complicated, but with the help of Phabricator’s command-line tool, Arcanist, it can be easily implemented.

The Feature branch is different from the main branch. For example, after the Feature 1 branch is created, new changes in the Master branch do not enter Featrue 1. This allows the Featrue 1 branch to automatically Merge the trunk branch in the CI process (it is not pushed locally to the remote). If the Merge fails, the CI process fails. The developer manually updates the Feature 1 branch and synchronizes the Master code.

Hotfix development process

Hotfix development process is similar to that of Feature. It directly creates local development branch based on Master branch. After the development passes, it is submitted to Diff for static Code inspection, unit test, Code Review and QA test.

In some cases, the Release branch is needed when the Master branch has incorporated a new Feature and there is an urgent Hotfix that needs to be released. You can create a Release branch based on the last Commit that went online, and then distribute the Hotfix from Master cherry-pick to the Release branch.

Keep the Release branch in case a new Hotfix needs to be released, and delete the Release branch until the next Release of the Master branch. Because of our SaaS service, which is published very often, this situation rarely happens.

The CI process

CI process based on Feature branch

In Growingio, there are always multiple Feature teams developing different features in parallel, and each Feature Team has its own independent development co-tuning environment (corresponding to the Feature branch).

To develop and submit code to the Feature branch, Jenkins automatically detects the branch changes in the corresponding code repository, and then starts the corresponding task for static inspection, unit testing, compilation, packaging, and deployment of the code to the development co-debugging environment.

For front-end and server-side applications, they are packaged as Docker image and deployed in the corresponding branch K8S environment, while for data-side applications, they are packaged as ZIP package and deployed to VM-based environment.

After the code of the same Feature is submitted, the development environment will conduct joint tuning tests. Git Push → Code Check → Unit Test → Build → Deploy → Integrated Testing After passing the joint test, submit the DIFF to trigger the following process.

CI process based on DIFF

Develop automatic rules defined by Phabricator after submitting Diff, automatically call Jenkins’ Webhook to trigger corresponding Jenkins task to merge with trunk branch (not submit to remote branch, just check for conflict), static code check, unit test, etc. Sonar scans and uses Jenkins’ Phabricator plugin to automatically add single-test coverage and other information to the comments of the Diff.

If the above automatic check step fails or the single test coverage is not up to standard, the corresponding development changes will be required until the automatic check passes (although this restriction may also be relaxed when there are urgent Bugfixes that need to be released quickly).

When the automatic check is passed, other developers in the group can be informed to conduct manual code review. If the check is passed, the Diff will be marked as Accepted through Phabricator. The automatic rules defined in the Phabricator at this point automatically add the QA Team to the Blocked Reviewer of the Diff (if the Blocked Reviewer does not already have Accepted, the Diff cannot be merged into the trunk branch). And notify QA staff via email for testing.

QA staff will deploy the corresponding Diff to the corresponding QA environment for testing by manually running Jenkins task as needed. After passing the test, they will put the “Accepted” marked by Diff in Phabricator again. The corresponding development Land code is notified (Land is a Phabricator term that merges code into the trunk branch and turns diff off, equivalent to the PR/MR Merge of GitHub/GitLab), although QA can also directly Land code. Its key process: Create Diff → Auto Check → Peer Review → QA Review → Land.

Below is a simple flow architecture diagram of the above process:

Release strategy

After going through the CI process above, code that has entered the trunk branch is theoretically ready to be deployed. In practice, however, the pace of release is controlled for quality assurance and release costs.

We are using an Agile Release Train (ART) model based on fixed Release cycles, with one official Release per week (unlike Hotfix) and usually deployed on a Tuesday night.

The quality requirements

As mentioned above, GrowingIO is developed by multiple Feature teams in parallel. Although each Feature submission code is tested before entering the trunk branch, multiple teams may make changes to the same repository during a release cycle. Multiple Diff interactions may introduce new defects.

In addition, during the Diff test, we usually focus on the part that may be affected by the modification of Diff itself. Although the development and testing will try their best to analyze the scope of influence, due to the knowledge of everyone and the complexity of the system, a Diff modification may cause defects in unexpected places.

Therefore, a comprehensive regression test must be carried out for the functionality to be released before release (in this respect we have made more than one mistake ￣□￣ | |). This regression process can take anywhere from half a day to one day for the entire QA Team, and defects found during regression testing require immediate fixes.

In addition, during regression testing there is a Code Freeze phase, where all the Code warehouses involved in the release are prohibited from developing LAND Code to the trunk branch, and only QA has access to LAND Code to the trunk branch.

The main purpose of this is to avoid functionality that has been regression tested because new code introduces bugs; It also ensures that the Bugfix for regression tests can enter the trunk branch. If the defects found in the regression test cannot be fixed within a short time, it is necessary to remove the problematic Diff in the release branch, locate it to the Diff that introduced the problem, and then Revert.

Tips: Why not delay the release after regression testing found defects?

Because there is other cargo on the “train” (new Feature or Bugfix) that needs to be delivered to the customer on time.
A Code Freeze that is too long will affect the normal iteration, the Code will not be merged into the trunk, and later the dependency development will be blocked.
Cancelling the Code Freeze delay and having to do regression testing all over again is a waste of manpower and money.

Cost considerations

As mentioned above, a full regression test is required for each official release, and if the release is frequent, it will inevitably result in a large amount of manpower. Moreover, if QA is doing regression testing, who will do the Diff testing?

The solution to this problem is to improve the efficiency of regression testing. One is to increase the automated test coverage of regression testing; the other is to adopt some precise test strategies to reduce the unnecessary regression test execution. But these need some resources to build slowly, in a short period of time is difficult to achieve, so choose a reasonable pace of release is the most simple and effective way.

Characteristics of the switch

Sometimes, a new feature that is already in the trunk is something that users don’t want to see, either for marketing reasons, or because the feature is still iterating and not ready enough. This can be controlled using a Feature Toggle.

Now Growingio’s Feature switch can be realized. According to the customer organization ID, the customer project ID, the user mailbox, the user mailbox suffix, and the custom script rules, all kinds of fine Feature control can be realized. These configurations can be hot updated by modifying the configuration file.

Deployment strategy

Predeployment preparation

After the regression test is passed, the QA team will prepare for the release, including the following aspects:

Generate JIRA Release Notes, and check whether all the tickets associated with Release Version in JIRA are completed.

If it is incomplete, check if the status has not been updated in time, update the status.
If it is incomplete, if it is incomplete move to a later version.
If the Diff with the problem found in the regression test is removed, the Ticket will be moved to a later version.

2. Generate project Release Notes

It contains the steps to go online, which branch of the repository to go online, and it’s usually the Master. If there are special operation instructions, such as the need to modify the configuration or execute SQL scripts, specify them.
A detailed list of changes, which new diffs are included in the release, and their corresponding JIRA Ticket. The main purpose is to facilitate the quick positioning of suspicious Diff and problem detection through the modification list of this on-line after problems appear after the on-line.
Even a simple Hotfix must have a release list every time it goes online.
The Release manifest has a specific Version number that corresponds to the Release Version of JIRA.
After the Release Notes are generated, they will be released to the R&D team for Review by each team.

Tips: Project Release Notes mainly uses scripts to automatically check all code repositories to be released, generate them automatically, and then make simple manual adjustments if necessary.

preparedness

When the release list is ready, the QA team will submit the application for launching to the SRE team through the pinning group. After receiving the application, the SRE team will check whether the steps of the release list are clear and clear, and be familiar with the release content. If confirmed, the code to be released will be compiled by Jenkins ahead of time to prepare for deployment ahead of time.

The deployment process

Dongfeng Express, the mission must reach 🚀, after reaching the scheduled release time, SRE will start deployment. To ensure that the service does not break during deployment, GrowingIO currently uses a blue-green rolling deployment on the server side.

We have two similar server clusters online. The microservers in these two clusters are registered in separate GOUP, and the traffic is isolated from each other. During the deployment process, the PROD0 cluster will be taken offline first, and then all the customer traffic will enter the remaining cluster PROD1. SRE will update the services in PROD0 according to the deployment list, and notify QA to verify in the PROD0 cluster after the deployment is completed.

The SRE team will be notified after the QA validation is approved and the release will continue. At this point, the SRE team will bring the Prod0 cluster online, take the Prod1 cluster offline, and update the Prod1 cluster.

QA will be notified again to verify in the PROD1 cluster after the cluster is updated. The SRE team will be notified once the QA validation is passed and the release will continue. The SRE will bring the Prod1 cluster online and notify the deployment is complete.

It is important to note that application releases on the data side follow a rolling release strategy, so they will be released along with Prod0 releases.

Of course, despite all the preparation and testing, accidents can happen, but they’re rare. If QA verifies that there is a serious problem during the rollout process, or if SRE detects an abnormality through monitoring alerts, the rollback of the service will be aborted.

Jenkins executes Capistrano scripts to complete the deployment and rollback operations. No human intervention is required to complete the deployment details. In most cases, the deployment process takes less than 30 minutes.

Deployment of ending

After a successful deployment, the QA team will send a Release email to inform all of the company of the details contained in the Release, namely “JIRA Release Notes” and “Engineering Release Notes” in the pre-deployment preparation work.

The former is read by business people and the latter is read by engineering people, so people across the company can keep track of new features they’re interested in or whether Bugfix is online.

Problems and deficiencies

The above CI/CD process still has many shortcomings. The most important ones are the following.

1. Source code based distribution

The current GrowingIO SaaS service releases, whether deployed to DEV, QA, Staging or PROD environments, are recompiled and deployed based on large code branches. The problem with every source-based release is:

It’s a waste of time, recompiling the code with each release, and some services take as long as 10 minutes to compile.
It is possible that because the base environment of each compilation is different, some changes in the remote dependency version will result in the compiled code being different, which will cause problems after the test goes live.

This problem is relatively easy to solve, for example, the CI process implemented in GrowingIO privatization deployment product line has been implemented based on binary package delivery, which enables one build, multiple environment deployments, of course, with the basic premise that the code and configuration are separated. The main reason why SaaS is not currently being reworked is that it is intended to be fixed when the production environment is deployed in the Kubernetes-based container cloud.

2. Insufficient coverage of automated test

An efficient CI/CD process cannot be separated from automated testing. Currently GrowingIO’s unit testing, API automated testing, and UI automated testing coverage across the entire SaaS product is not complete enough, resulting in excessive reliance on manual inspection and low efficiency. This depends on the company’s continued investment in quality assurance to improve gradually. There are no shortcuts.

3. The whole CI/CD process is divided into several sections, which are connected manually by mail, stapling and Jenkins

At present, the whole CI/CD process is divided into Feature CI process, Diff CI process and deployment process. The three processes are notified by email and pinned messages, and different Jenkins tasks are connected in series. As companies grow in size, such collaborative processes become increasingly chaotic and inefficient, and data collection and measurement become difficult.

The effective method to solve this problem is to develop a tool platform to integrate the process and tools, to provide a unified management entrance, process specification, and to realize the automatic data collection and analysis.

4. The deployment process is too complex to allow for lighter rolling releases and Canary deployment

The above mentioned rolling blue-green deployment is used to ensure that user service is not interrupted during the release process. When deployed in this way, half of the cluster’s entire microservers are taken offline, and the maximum service capacity provided online is halved. This operation is obviously very risky during peak user usage and limits the choice of release window.

In addition, it is not reasonable to take the entire half of the cluster offline every time you upgrade a single microservice. The current plan is to use a Kubernetes-based container cloud deployment approach in production to address this issue.

conclusion

This article introduces the concept of CI/CD, the various tools that GrowingIO SaaS products use internally, and the specific practices on CI/CD. As mentioned above, we are still a long way from a mature CI/CD. There are even places where best practices are not followed, such as source-based distribution.

But these practices are based on the actual situation of the company step by step to build up the perfect, I hope you have some inspiration. In addition, this article describes the overall CI/CD macro flow, and future articles will continue to describe some specific tool configuration methods, as well as GrowingIO’s CI/CD improvements for privatized deployment products.

About GrowingIO

Growingio is the leading one-stop digital growth solution provider in China. Provide customer data platform (CDP), advertising analytics, product analytics, intelligent operations and other products and consulting services to product, operations, marketing, data teams and managers to help companies improve data-driven capabilities and achieve better growth on the road of digital transformation.

Click “here” to get GrowingIO for a 15-day free trial!