Author: Zou Sheng, senior DevOps engineer of Qunar Infrastructure team, mainly responsible for the development and maintenance of CI/CD platform and the research and implementation of cloud native technology. He is also a KubeSphere Talented Speaker. Jingxian Chen is DevOps Product Manager at Qunar. Currently, he is mainly responsible for spreading DevOps culture in Qunar, investigating, importing and developing best practices in processes, tools and platforms to help companies deliver software faster, reduce risks and reduce operating costs

In recent years, with the maturity of cloud native technology, Qunar has taken the first step toward cloud native in 2021 — containerization in order to realize the evolution of the whole technology system. The landing process includes value assessment, infrastructure construction, CI/CD process transformation, middleware adaptation, observable tools, application automation migration, etc. The migration process involves about 3,000 applications and involves a thousand-level R&D team, which is a very difficult system engineering. This article will cover best practices for CI/CD model transformation, automation application container transformation, and other aspects of the migration process.

background

Business pain points before container landing

Before containerization, we often hear various complaints and jokes from different characters:

  • Leadership voice: Server resources and maintenance costs are too high, we have to reduce costs!
  • OPS voice: one of the hosts is down and needs to be restored as soon as possible!
  • QA voice: The test environment is fine, why did the launch fail?
  • R & D voice: Business peak, resources are not enough! Go and enlarge it! Ah! Expansion is too slow.

These problems are caused by resource utilization, service self-healing failure, inconsistency between the test environment and the online environment, and lack of flexibility in the operating environment.

The business environment faced by enterprises is complex and changeable. The competition in today’s environment is extremely fierce, and it is no longer big fish eating small fish, but fast fish eating slow fish. Jack Welch, the former CEO of General Electric, once said, “If the speed of external change exceeds the speed of internal change, the end is near.” How can our enterprises survive and develop in such an environment? All enterprises are moving forward in constant exploration and practice. In recent years, cloud native technology has become increasingly mature, has been widely accepted and adopted by the majority of enterprises. Cloud native technology can help enterprises reduce costs, speed up business iteration, and provide strong technical support for industrial upgrading of enabling enterprises. As one of the cloud native technologies, containerization has become an important part of Qunar’s technology upgrade embracing cloud native technologies.

Challenges and solutions of container landing process

It is not easy to implement containerization in the whole company. We are faced with numerous difficulties, but we try our best to solve them one by one in order to achieve the final goal.

First, there are multiple departments involved: Containerized migration involves infrastructure departments such as OPS, infrastructure, and data groups, as well as 20+ business departments. It is not easy for so many departments to agree on goals and keep their actions coordinated. Fortunately, our project has been recognized by the senior management of the company and has been set as an enterprise-level goal for 2021. Under the guidance of enterprise-level goals, all departments work together to ensure the success of the project.

Secondly, the transformation scope is large: due to historical reasons, our services are mostly stateful. The middleware, publishing system, network, storage, log collection, alarm monitoring, and service itself are strongly dependent on the status of machine names, IP addresses, and local storage. The container itself is stateless, which means we need to reinvent the infrastructure, tools for distribution, operations, and platforms. In view of these problems, we enumerate them one by one, and finally meet the conditions of containerized migration.

Thirdly, the cost of business migration is high: the number of applications involved in this migration is about more than 3000, and the migration process requires upgrading, testing and regression of applications and putting them online, which will cost a lot of manpower. How to reduce the cost of business migration? We support will work most of the adaptation in middleware layer for unified support, and support for the automatic update of middleware in the business code, automatic container migration, through a continuous delivery pipeline for changes in the migration process automated test and verify, after container and virtual machine gray mix of observation methods greatly reduced the cost of business human migration.

Finally, the cost of learning is high: Our R & D team has a scale of thousands. As for the introduction and upgrade of new technologies, the r & D students need to spend extra cost to learn and use them. In order to reduce the cost of learning, we shield different operations and technical details through platform tools, and reduce the cost of learning by means of visual configuration, guided operation and optimization of continuous delivery process.

Revenue after containerization

In 2021, we completed containerization infrastructure construction, tool platform upgrade and 90% application containerization migration (total number of applications + 3,000). From the effect data, the proportion of containerization virtual was increased from 1:17 to 1:30, and the resource utilization was increased by 76%. Before, the host operation and maintenance time was in days, but after containerization, it became in minutes, and the operation and maintenance efficiency increased by 400 times. Application delivery speed increased by 40% due to reduced containerization startup time and optimized deployment strategy; K8S cluster provides self-healing capability of service, and the average self-healing times of application running reaches 2000 times/month. In addition, the implementation of containerization also lays a foundation for the further promotion and implementation of cloud native technology.

Continuous delivery

Project DEVELOPMENT process

Qunar adopts a dual-stream model of business-driven value stream and application-centric continuous delivery workflow. Enterprises take business value delivery as the goal. In the delivery process, deliverables at each stage will be transferred among multiple roles, and different degrees of waste will inevitably occur in the process of transfer. In order to effectively measure and optimize the efficiency of value delivery, we need to pay attention not only to the efficiency improvement in the development process, but also to the efficiency improvement before development. We connect the flow of continuous delivery workflow with the flow of value stream, hoping to realize the automatic flow of process from project domain, development domain, test domain to operation and maintenance domain. However, before containerization, due to inconsistent environment, inconsistent configuration of each stage and other reasons, in the delivery process, manual intervention is inevitable when the delivery process flows between stages. After containerization, we refer to the cloud native OAM model to standardize the definition of the application, establish application portraits, unify terms, eliminate data islands, and enable smooth and high-speed flow of the process.

Application of portrait

For the flow of the above project, the most important connector is the App code, so we also made an abstraction for it, portrait definition:

Developers need to write and maintain multi-application deployment profiles in complex environments such as development, testing and production. O&m personnel need to understand and connect with different platforms and manage different O&M capabilities and processes.

Following the principles of developing application models in cloud native, we specify standardized definitions of applications by creating application portraits. Our development and operation personnel collaborate with each other through standard application descriptions to easily realize one-click deployment and modular operation and maintenance of applications without the need for service opening, configuration and access, thus improving the efficiency and experience of application delivery, operation and maintenance.

With the help of application standardization, the application platform is unified and standardized. We changed from being resource centric to application centric. The platform architecture is as follows:

It’s cloudy synergy

Infrastructure resource layer: The underlying resources support both KVM and containers and are deployed across machine rooms. At the same time, in order to make the underlying resources more flexible, we connected to the public cloud.

Platform layer: based on K8s, KubeSphere multi-cluster management, Service Mesh, Serverless and other cloud native technologies to improve technology advancement and technology architecture evolution.

Resource scheduling:

  • Affinity of the node is considered: Machine configuration, common disks, SSDS, GIGABit nics, and gigabit nics
  • Capacity estimation: Estimate the capacity before release. If resources are insufficient, do not release the capacity
  • Machine load: Try to balance the load among all nodes in the cluster

Double Deployment release

Advantages:

  1. Reduced operation complexity, operation only include create, scale, delete. A single Deployment update process may fail, get stuck in an intermediate state, and neither upgrade nor rollback can take place
  2. Batch operation is supported and the release process is more controllable
  3. Updating the Deployment label becomes possible

Disadvantages:

  • The Deployment state needs to be logged and controlled, which can be more complex than a single Deployment

Automatic migration of service applications

This containerization needs to migrate 3000+ applications. In order to reduce the migration cost of developers and testers, we provide a solution of automatically upgrading Java SDK and automatically migrating containers.

Automatic migration scheme:

  1. Pre check

    • Verify that the SDK for the automatic upgrade is referenced
    • The compile phase verifies whether there are any Java methods that are not suitable for containerization scenarios, such as whether a method dependent on hostname is called. If there are any methods, the container will fail to publish in advance. Because it is normal for container scene IP to change and hostname to change frequently, it would be problematic for business line to rely on hostname to distinguish business logic in the past
  2. Test environment verification

    Release the SDK in the test environment after automatic upgrade and verify that the containerized upgrade of the application is ok

  3. Online Environment verification

    • Gray published online, temporarily do not access traffic
    • Notify the application owner to automate the tests
    • If no problem is found, access online traffic
  4. Mixed deployment

    • The online container and KVM are deployed at the same time, and the traffic ratio is adjusted based on the instance ratio
    • Pay attention to indicators and alarms
  5. All traffic is connected to the container

    The container is fully deployed and the KVM capacity is removed (the KVM will be kept for several days for observation).

  6. To observe the

    Pay attention to service monitoring indicators after all containers are used. If exceptions are found, go back to the KVM

  7. KVM recycling

    In order to quickly release resources to the K8s cluster, the owner of the application will be notified of the reclamation machine within the specified period. If the reclamation time (7 days) is exceeded, the remaining KVM resources will be forcibly reclaimed

Automatically upgrade the SDK process

Tcdev BOM is a common infrastructure for the company to manage binary and tripartite package dependencies. Most Java applications use this component, so our upgrade SDK solution is to automatically detect and upgrade the TCDEV version during compilation.

Prerequisites for upgrading the SDK

The premise instructions
Unified management of two-party and three-party packages Super POM, TCDEV-BOM: Unified management and upgrade of core components
The quality of entrance guard Static code check, version compatibility check, quality control before launch
Automatically verify component compatibility after upgrade Run automated tests in batches to compare the master branch with the upgraded branch
Supermanage permissions for publishing Automatic upgrade, gray scale, online

Upgrade steps

After the review, many business students also gave feedback that the self-service upgrade and migration saved them a lot of time, and the value was very obvious, which was recognized by everyone.

conclusion

In the process of cloud native transformation, it is very challenging to make the business more smoothly enjoy the dividend of cloud native. I hope this article can bring some inspiration to students who just entered cloud native. Cloud native road, we work together!

This article is published by OpenWrite!