GrowingIO SaaS CI/CD Practice (Part 1)

GrowingIO QA Leader, formerly worked for HP and Qihoo 360 in China. Lead QA team in charge of GrowingIO product line quality assurance, currently focusing on DevOps practice, helping the team improve quality and efficiency.

purpose

This article describes some of GrowingIO’s past practices in SaaS product online CI/CD. Due to historical reasons, part of the tool chain used by the company is relatively small, so there is still a lot of room for improvement in the current CI/CD process, but some practical experience is of certain significance.

What is the CI/CD

CI/CD is a way to frequently deliver applications to customers by introducing automation during the application development phase. The core concepts of CI/CD are Continuous Integration, Continuous Delivery and Continuous Deployment. The industry understands CI/CD as follows.

CI Continuous Integration

Continuous integration is a development practice in which developers frequently commit code to the trunk, which is verified by compilation and automated test flows before it is finally merged into the trunk.

Continuous integration is the process of automatically detecting, pulling, building, and (in most cases) conducting unit testing and static quality analysis after source code changes.

The goal of continuous integration is to quickly ensure that newly committed changes by developers are good and suitable for further use in the code base. The process execution and theoretical practice of CI allow us to determine whether the new code and the original code can be correctly integrated.

CD Continuous Delivery

Continuous delivery automatically publishes validated code to the repository after completing the automated process of building and unit and integration tests in CI. To achieve an efficient continuous delivery process, it is important to ensure that CI is built into the development pipeline. The goal of continuous delivery is to have a code base ready to deploy to production.

In continuous delivery, each phase, from the consolidation of code changes to the delivery of production-ready builds, involves test automation and code release automation. At the end of the process, the operations team can quickly and easily deploy the application into production or release it to end-users.

CD Continuous Deployment

The final stage for a mature CI/CD Pipeline is continuous deployment, which automatically releases the application to production.

Continuous deployment means that all changes are automatically deployed into production, but you can choose not to deploy for business reasons. Continuous delivery must be implemented if continuous deployment is to be implemented.

Continuous delivery does not mean that every software change should be deployed to production as soon as possible. It means that any code change can be deployed at any time. Continuous delivery represents a capability, and continuous deployment represents a means. Continuous deployment is the highest stage of continuous delivery.

How to implement CI/CD

There are several ways to implement a CI/CD process:

1. Buy ready-made products or services

There are many mature products that offer full DevOps functionality, such as Atlassian, Microsoft Azure DevOps, Ali Cloud Effect, Coding.Net, etc. If you are a new startup, you can choose to buy the right products and services. You can quickly build a complete DevOps system by leveraging the tools and best practices provided by DevOps products. Many DevOps services are integrated with cloud vendors, and it is not expensive to use DevOps services that are already in use.

2. Integrate existing tools

This is the most common practice in most companies at present, integrating internal project management, code management, product management and other tools, or doing some simple secondary development to achieve a complete SET of CI/CD process. The overall cost is manageable, the existing infrastructure is adaptable, and it can be customized according to the company’s own processes.

Most companies adopt this approach, because when they are founded, they will inevitably survive before they grow. They will not put engineering efficiency first and will use some separate tools (such as code management) to solve problems in a certain area. As companies grow and become more concerned with engineering efficiency, integrating existing tools is clearly the most common choice, and GrowingIO is no exception.

Use some open source DevOps platform

At present, there are also some open source products, which have done secondary development integration based on various open source products, forming a complete DevOps platform, such as toothfish. If you don’t want to buy a cloud service and don’t want to do secondary development, try such tools.

But it is worth noting that the use of open source tools resources will not necessarily less, especially the problems need to solve, or tools can’t meet our needs when needed a second transformation, all need to in-depth research and understanding of tools, so choose this kind of product also need to consider whether matching with the company’s current technology stack, the maintenance ability.

4. Design and develop their own implementation

Their new wheel, start from scratch to design and develop a set of implementation. Obviously, this method has the highest investment cost, but it is also the most flexible and can best meet the needs of enterprises. Generally, some super-large enterprises will spend resources on R&D. For example, some cloud service products mentioned in the first one are initially developed and used by enterprises themselves, and then provided as products to the outside world after maturity, so as to obtain more economic returns.

Tools used

If you want to do a good job, you must first sharpen your tools. To achieve an economic and efficient CI/CD process, choosing the right tools can achieve double results with half the effort. A typical CI/CD process construction requires a tool with at least the following capabilities.

Code repositories, which require version control software to ensure the maintainability of code and serve as a repository for the build process
Continuous integration server, used to automate build, test, deploy, and more
A centralized product management repository for building artifacts for deployment
Automatic deployment tool, used to build applications automatically deployed to the target server

Code management Phabricator

Phabricator is a powerful software development collaboration tool similar to GitHub, GitLab, etc. It supports Git, Mercurial, Svn repository hosting, code Review, command line tools, task management, kanban, Wiki, automatic rules, WebHook, API and many other functions.

Continuous integration server Jenkins

There are many automation tools that can realize continuous integration, among which Jenkins is the most popular open source tool. Jenkins supports multiple mission types, 1000+ plugins, community and ecology. Jenkins version 2.0 provides the Pipeline as Code feature to incorporate the definition of CI/CD pipelines into version management, with GitOps support. In addition, Jenkins supports master-slave architecture implemented in various ways, among which the master-slave mode realized by virtue of Kubernetes’ powerful scheduling and scheduling capabilities can realize the dynamic creation and destruction of slave nodes, which greatly improves CI execution efficiency and resource utilization rate.

Quality management platform SonarQube

SonarQube is a code quality management platform that detects potential bugs, security vulnerabilities, code specifications, duplicate code, lack of unit testing, and other code quality issues in a project, and provides a UI to view them. You can maintain and improve code quality by controlling thresholds for relevant code quality indicators. With SonarQube’s free community edition, you can meet most project needs, and it’s easy to integrate with third-party CI/CD and Code Review tools.

Artifact library manages Nexus

Nexus is an artifact warehouse management tool developed by Sonatype, supporting Maven, NPM, PyPI, Docker, Helm and dozens of artifact management. In addition, it also supports Webhook, and REST API can be easily integrated with third-party tools. The free open source version of Nexus (Nexus Repository OSS) already provides most of our needs.

Deployment tool Capistrano

Capistrano is a Ruby language, free open source remote server automation management tool. Capistrano is based on SSH agent free mode, only need to install a client can easily achieve the management of multiple services. Capistrano provides a DSL and workflow for deployment and rollback, making it easy to automate remote deployment and rollback of services. At the same time, it is easy to customize plug-ins or scripts to extend the functionality to achieve personalized requirements.

In addition to the above tools also use Kubernetes, Docker and other tools. There are many alternatives to the functionality provided by the tools, depending on the company’s technology stack, current tool usage, engineering practices, deployment mode, and many other factors.

Source code Management Strategy

Source code management, as the starting point of CI/CD, will be deeply integrated with CI/CD Pipeline. Different code branching strategies will affect the design and implementation of CI/CD Pipeline. Therefore, the branch management strategy must be designed according to the company’s current cooperation process before the design of CI/CD.

There are many Git branch management schemes, each of which has its own advantages and disadvantages, and has its own trial scenarios, among which the famous ones are Git-flow, trunk-based and Github-FOW. Each enterprise will customize its branch strategy, such as Alibaba’s AoneFlow.

GrowingIO tends to use a trunk-based branching strategy, ideally “Trunk development, Trunk release,” which requires high quality code development.

At present, the development practice of the company cannot meet such requirements, so the next best thing is to use “branch development, trunk release”. When the branch life cycle is very short, there will be no code merge conflict, which is basically equivalent to trunk development. The branch publishing strategy is also used temporarily when a Bug in the trunk branch affects the release.

Branch instructions

The Master branch

The Master branch is the latest code integration trunk branch, and it is also the Release branch. Generally, the code can only enter the Master branch after it has been fully tested. The Master branch is ready for Release.

This branch forbidding git push directly, must submit to modify Diff (equivalent to Github PR, MR of Gitlab, is the unit of Code Review in Phabricator), after passing Code Review and test acceptance, can merge.

The Release branch

The Release branch is a temporary branch, which is used in cases where the Master branch does not meet the requirements to go online but needs to go online urgently.

For example, if a change introduces a serious Bug during the launch process, it is common to remove the problematic Commit and create a temporary branch to continue the release so as not to affect the normal release of other changes.

Feature branch

The functional development branch, or integration test branch, is the most active branch of the development process and is integrated every time code is submitted to this branch. In the microserver development mode, it is difficult for developers to build a complete integration environment locally, so the cloud integration test environment is needed to help the development to complete the joint commissioning test.

If you pass integration testing, QA acceptance, and production acceptance before your code goes into the trunk, you can greatly reduce the risk of code going into the trunk. Of course, the life cycle of Feature branches should be short enough to avoid code merge conflicts.

Trunk-based opposes the use of Feature branches with long life cycle, and encourages the use of Feature switches and abstract branch techniques to merge codes into Trunk branches as soon as possible without affecting Trunk branch functions.

The Local branch

Developer local branch, local branch because Phabricator allows you to submit Diff for code Review and deployment without creating a remote branch.

After local debugging and unit testing, the Code of the development branch will submit Diff for Code Review based on the Master branch or Feature branch, and enter the corresponding Base branch after the Review passes.

Feature Development process

Generally, to achieve a relatively large Feature, multiple components on the data side, multiple microservices on the server side and collaborative development on the front end are required. EM (Engineering manager, Engineer Manger) divided Feature Ticket into multiple sub-tasks in Sprint plan and handed over to different development to realize them.

Developers of each module create Feature branch from Master and push it to remote, as shown in the Feature 1 branch below
The developer of the corresponding module can create a local development branch from the Feature branch to realize a sub-task. Multiple developers can create their own local branches when working in a code warehouse at the same time, such as Dev1 and Dev2 branches in the figure below
After the developer develops and debuts locally and executes the unit test, he/she submits Diff for Code Review (CR). If THE CR does not pass, he/she will modify the Code and update Diff until the CR passes and is merged into the Feature 1 branch, and then delete the local development branch. The Dev1 branch shown below. When another developer completes local development and needs to submit Diff, Rebase Feature branch should be first, then submit Diff, and then merge into Feature 1 branch after CR, and delete the local branch, as shown in the figure Dev 2 branch. Create a new local branch when you need to continue implementing new features or fixing bugs in the Feature 1 branch
The code of the Feature branch will be deployed to the corresponding integration test environment for multi-end developers to conduct joint test
After the Feature branches pass the joint tuning, another Diff will be submitted based on the Master branch, and then the QA test will be submitted. After the QA test is passed, the Diff will be merged into the Master branch and the corresponding Feature branch will be deleted
If a Featrue branch of the repository has only one developer working on it, it can be pushed, such as Featrue 2, submitted to Diff for Coded Review when the code passes integration testing, and then entered into the trunk branch when CR passes. However, even if there is only one developer, it is recommended to submit the code to the Feature branch in the form of submitting Diff, so as to prevent the final submitted Diff from being too large, which is not conducive to CR

** Rebase or Merge should be performed with the remote Base branch before each Diff submission to prevent code Merge conflicts.

The above process seems complicated, but it can be easily implemented with the help of Phabricator’s command line tool Arcanist.

The problem here is that the Feature branch will be different from the main branch. As shown in the figure above, after the Feature 1 branch is created, the new changes of the Master branch will not enter Featrue 1. The Featrue 1 branch automatically merges the main branch in the CI process (it is not pushed locally to the remote branch). If the Merge fails, the CI process fails and the developer manually updates the Feature 1 branch to synchronize the Master code.

Hotfix development process

Hotfix development process is similar to Feature. It directly creates a local development branch based on the Master branch. After passing the development, it submits Diff for Code static check, unit test, Code Review and QA test.

In some cases, the Release branch is needed when there is an urgent Hotfix that needs to be released because the Master branch has incorporated new features. You can create a Release branch based on the last Commit and publish hotfixes from Master cherry pick to the Release branch

Keep the Release branch in case a new Hotfix needs to be released until the next Release of the Master branch is removed. This happens less often because of our SaaS services, which are released more frequently.

The CI process

CI process based on Feature branch

In GrowingIO, there are always multiple Feature teams developing different functions in parallel, and each Feature Team has its own independent development joint tuning environment (corresponding to Feature branch).

The development submits the code to the Feature branch, Jenkins automatically monitors the change of the code branch in the corresponding code warehouse, and then starts the corresponding task to carry out the static inspection, unit test, compilation, packaging, and deployment to the development joint adjustment environment.

For front-end and server applications, Docker images are packaged and deployed in the K8S environment of the corresponding branch; for data applications, ZIP packages are packaged and deployed in the VM-based environment.

After all the codes of the same Feature are submitted, the development will conduct joint test in the development environment. Git Push → Code Check → Unit Test → Build → Deploy → Integrated Testing. The following process is triggered by submitting the Diff after the test is passed.

Diff based CI process

After submitting Diff, automatic rules defined by Phabricator are used to automatically call Jenkins’ Webhook to trigger the corresponding Jenkins task for merging with the trunk branch (not submitting to the remote branch, but checking for conflicts), code static inspection and unit test. Sonar scans and automatically adds single-test coverage and other information to the comments of the corresponding Diff through Jenkins’ Phabricator plugin.

If the automatic check steps above fail or the single test coverage is not up to standard, development changes are required until the automatic check is passed (although this restriction is sometimes relaxed when urgent bugfixes need to be released quickly).

When the automatic check passes, other developers in the group are notified to conduct a manual code review. If the review passes, Diff is marked as Accepted by Phabricator. The automated rules defined in Phabricator will automatically add the QA Team to Blocked Reviewer of the Diff (if Blocked Reviewer is not already Accepted, the Diff cannot be merged into the trunk branch). And notify QA staff by email for testing.

QA personnel will deploy the corresponding Diff to the corresponding QA environment for testing by manually running Jenkins task as required. After passing the test, Diff will mark Accepted in Phabricator again. Notify the developer of the Land code (Land is a term for Phabricator, merging the code into the trunk branch and closing the DIff, equivalent to Github/Gitlab PR/MR Merge), and of course QA can directly Land the code. The key process: Create Diff → Auto Check → Peer Review → QA Review → Land.

Here is a simple process architecture diagram for the above process:

Release strategy

After going through the CI process above, code that is already in the trunk branch is theoretically ready for deployment. In practice, however, the pace of release is controlled for quality assurance and release costs.

We use the Agile Release Train (ART) model based on a fixed Release cycle, usually one official Release per week (as opposed to the Hotfix version), usually deployed on Tuesday night.

The quality requirements

As mentioned above, GrowingIO is developed by multiple Feature teams in parallel. Although the code submitted by each Feature is tested before entering the trunk branch, multiple teams may modify the same code repository in a release cycle. Multiple Diff interactions can introduce new defects.

In addition, during Diff testing, we generally pay attention to the parts that may be affected by Diff modification. Although development and testing will try to analyze the scope of influence, due to the knowledge of everyone and the complexity of the system, a Diff modification may cause defects in unexpected places.

Therefore, it is important to do a comprehensive regression test on the features to be released before release (we made mistakes in this area more than once ~ □ ~ | |). This regression process can take anywhere from half a day to a day’s work for the entire QA Team, and defects found during regression testing require immediate repair.

In addition, there is a Code Freeze phase during regression testing. All Code repositories involved in the release are prohibited from developing Land Code to the trunk branch, and only QA has permission to Land Code to the trunk branch.

The main purpose of this is to avoid regression-tested functionality because the new code introduces bugs; At the same time, it ensures that regression test bugfixes can enter the trunk branch. If a defect found during regression testing cannot be fixed in a short period of time, remove the problematic Diff from the release branch, locate the Diff that introduced the problem, and then Revert.

** Why wasn’t the release delayed after regression tests found defects?

Because there are other goods on the train (new Feature or Bugfix) that need to be delivered to the customer on time.
Too long a Code Freeze will affect normal iteration, Code cannot be merged into the trunk, and dependency development will be blocked later.
If you cancel Code Freeze, you have to redo the regression test, which is a waste of manpower and resources.

Cost considerations

As mentioned above, full regression testing is required for every official release, and frequent releases will inevitably result in a lot of labor costs. Furthermore, if QA does regression testing, who will do Diff testing?

The solution to this problem is to improve the efficiency of regression testing. One is to increase the coverage of automated regression tests, and the other is to adopt some precise testing strategies to reduce unnecessary regression test execution. However, all of these require some resource investment to build slowly and are difficult to achieve in a short period of time, so choosing a reasonable release pace is the simplest and most effective way.

Characteristics of the switch

Sometimes, a new feature that is already in the trunk branch doesn’t want users to see it, either for marketing reasons or because the feature is still being iterated. This can be controlled using Feature Toggle.

At present, GrowingIO Feature switch can be implemented, and various refined Feature control can be carried out according to customer organization ID, customer project ID, user mailbox, user mailbox suffix, and custom script rules. These configurations can be updated by modifying configuration files.

Deployment strategy

Preparing for deployment

After the regression test is passed, the QA team does pre-release work, which includes the following:

1. Generate Jira Release Notes and check whether the Ticket associated with Release Version in Jira is complete.

If the status is incomplete, check whether the status has not been updated in time and update the status.
If it is incomplete, if it is not completed move to a later version.
If the Diff found to be problematic in the regression test is removed, the corresponding Ticket is moved to a later version.

2. Generate project Release Notes

It contains the steps to go live, which branch of the repository to go live, usually the Master. If you need to perform special operations, for example, modify configurations or execute SQL scripts, you need to explain them in detail.
A detailed list of changes, which new Diff’s are included in this release and their corresponding Jira tickets. It is mainly for the convenience of quickly locating suspicious Diff and troubleshooting through the revision list of this launch when problems occur after the launch.
Even a simple Hotfix must provide a release list every time it goes live.
The Release manifest has a specific Version number that corresponds to the Release Version of Jira.
After Release Notes is generated, it will be released to the R&D group. Please Review it by each team.

The Release Notes project is primarily a script that automatically checks all repositories for Release, generates them automatically, and then makes simple manual adjustments if necessary.

preparedness

After the release list is ready, QA team will submit the application to SRE team through the peg group. After receiving the application, SRE team will check whether the steps of the release list are clear and familiar with the release content. If it is confirmed, Jenkins will compile the code to be released in advance, and prepare for deployment in advance.

The deployment process

Dongfeng Express, mission must reach 🚀, SRE deployment will begin after the scheduled release point. To ensure service continuity during deployment, GrowingIO is deployed in a blue-green rolling mode.

We have two similar server-side clusters online, and the microservers in the two clusters are registered in separate goups with traffic isolated from each other. During the deployment process, the Prod0 cluster is taken offline and all customer traffic flows to the remaining cluster Prod1. SRE updates the services on Prod0 according to the deployment list. When the deployment is complete, QA is notified to validate on the Prod0 cluster.

SRE team will be notified after QA verification and continue to release. At this point, the SRE team brings the Prod0 cluster online, takes the Prod1 cluster offline, and updates the Prod1 cluster.

After the Prod1 cluster is updated, QA will be informed to verify in the Prod1 cluster again. After passing QA verification, SRE team will be informed again and release will continue. The SRE will bring the Prod1 cluster online and notify the deployment is complete.

It is important to note that applications on the data side are released using a rolling release strategy, so they are released with the Prod0 release.

Of course, despite all the preparation and testing, accidents can happen, but rarely. If QA detects a serious problem during the service rollout or the SRE detects an exception by monitoring alarms, the service rollout is terminated and related services are rolled back.

The entire deployment and rollback is performed by Jenkins using Capistrano scripts without manual intervention, and in most cases the entire deployment process can be completed in less than 30 minutes.

Deployment of ending

After the deployment is successful, the QA team sends a Release email to inform all members of the company of the details contained in the Release, namely “Jira Release Notes” and “Project Release Notes” in the pre-deployment preparations.

The former is for business people, while the latter is for engineering people, so that the entire company can keep track of what new features they’re interested in or whether Bugfix is online.

Problems and Deficiencies

The ABOVE CI/CD process has many shortcomings, but the most important issues are the following.

1. Distribute based on source code

GrowingIO SaaS services are currently distributed, whether deployed to Dev, QA, Staging, or Prod environments, recompiled and deployed based on large code branches. The problem with every source-based release is that:

It’s a waste of time, recompiling the code for every release, and some services take longer to compile, up to 10 minutes.
The compiled code may be different due to some changes in the remote dependent version due to the different base environment of each compilation, leading to problems when the test is ok and goes online.

This problem is relatively easy to solve. For example, the CI process implemented in the GrowingIO private deployment product line has implemented binary-package-based delivery, which enables one compilation, multiple environment deployment, with the basic premise of separation of code and configuration. The main reason SaaS is not being revamped right now is that it is waiting for production environments to adopt kubernetes-based container cloud deployment.

2. Insufficient coverage of automated tests

An efficient CI/CD process cannot be achieved without the support of automated testing. Currently, the coverage of unit testing, API automated testing and UI automated testing of GrowingIO’s entire SaaS product is not perfect enough, resulting in high reliance on manual inspection and low efficiency of the whole process. This relies on the company’s continued investment in quality assurance to gradually improve, and there are no shortcuts.

3. The whole CI/CD process was divided into several stages, which were manually connected through emails, nails and Jenkins

At present, the whole CI/CD process is divided into Feature CI process, Diff CI process and deployment process. The three processes are connected by email notification and pinning message, and different Jenkins tasks are connected in series. As companies grow in size, such collaborative processes become more chaotic and inefficient, and it is difficult to collect and measure data.

The effective way to solve this problem is to develop a tool platform to integrate process and tools, provide unified management entrance, process specification, and realize automatic data collection and analysis.

4. The deployment process is too complex for lighter rolling releases and Canary deployments

As mentioned above, rolling blue and green deployment was used to ensure uninterrupted user service during the release process. In this deployment mode, all the microservers in half of the cluster are offline, and the maximum online service capacity is halved. This operation is extremely risky at peak user times and limits the release time window.

Furthermore, taking half the cluster offline every time a single microservice is upgraded is clearly not reasonable. The current plan is to use kubernetes-based container cloud deployment in the production environment to address this issue.

conclusion

This article mainly introduces the concept of CI/CD, the various tools adopted internally by the GrowingIO SaaS product, and the specific practices on CI/CD. As mentioned above, we are still a long way from a mature CI/CD. There are even areas where best practices are not followed, such as source-based distribution.

But these practices are based on the actual situation of the company step by step to establish perfect, I hope to give you some inspiration. In addition, this article mainly describes the overall CI/CD macro process, and subsequent articles will continue to describe some specific tool configuration and use methods, as well as GrowingIO’s CI/CD improvements in privatized deployment products.

About GrowingIO

GrowingIO is a leading one-stop digital growth solution provider in China. To provide product, operation, marketing, data teams and managers with customer data platform (CDP), advertising analysis, product analysis, intelligent operation and other products and consulting services to help enterprises improve their data-driven capabilities and achieve better growth on the road of digital transformation.

Click here to get a free 15-day GrowingIO trial!