Abstract:

Takeaway:In 2017 Beijing Cloud Conference, Alibaba senior technical expert Chen Xin (famous beauty), to bring you the “168.2 billion behind the enterprise-level efficient and sustainable delivery”. Shenshow deeply analyzed how to do continuous delivery at enterprise level, how to collaborate efficiently and how to control cost from the aspects of chaotic R&D process, quality guarantee, inefficient environmental management and resource waste that technical leaders are concerned about, combined with The DevOps practice of Alibaba.


First, the troubles of technical managers




Routine of development engineer







Let’s take a look at how development engineers work on a daily basis. The old three can’t get away with it. Write code, test it, publish it online. To be specific, the first step is to pull branches. Each team generally has its own R&D specifications, which all team members need to abide by to ensure the basic r&d order. Second, we would do some tests locally, write some test cases, and after passing the tests, we would make merge requests, and finally go through a process to release the code into production step by step.




These three things are simple to say, but they are the most important part of our r&d efficiency, because they are repeated every day. Even in the continuous repeated trample pit. There are duplication of work, there are quality issues, there are environmental issues, there are collaboration issues. One of the principles of continuous delivery is that if it pains you, do it early and as often as possible. Of course we can’t tolerate all kinds of pits, all kinds of inefficiency, so we have to find a way to solve these problems that make us headache.




The annoyance of technology managers




The daily troubles of developers are also the troubles of technical managers. To sum up, there are generally several aspects:




  • The r&d process is chaotic. When you have a lot of new people on the team, how can you make sure that every code commit is error-free? When working across teams, how can everyone be on the same page about the process? How can you solve this problem without an SCM team? In fact, in our practice, the SCM team can’t solve this problem, and it’s hard to be foolproof without good tool support.
  • Quality cannot be guaranteed. Whether test case coverage, protocol scanning, and security checks can be guaranteed, and how to promote the gradual improvement of code quality, rather than the formation of a broken window effect.
  • Inefficient environmental management and waste of resources. This is always a pain for operations engineers, especially in test environments. After microservitization, this problem is even more serious, and services depend on each other. Management complexity and workload have multiplied. Fortunately, containerization is a lifesaver, but it still doesn’t solve the application stability problem. The lack of stability directly caused development work to block each other, which ultimately slowed down our overall team productivity.
  • Miscellaneous open source tools. Can a process with a bunch of tools solve all problems in one place and have a better experience? I’m going to draw a question mark.


I think you can’t help but think of new terms like cloud computing, containerization, DevOps, etc. Let’s take a look at the current situation of these in the industry.




Continuous delivery and cloud computing









This is some information I extracted from the 2016 White Paper on Chinese Software Developers. One of the first things we see is the trend to go cloud in the enterprise, because going cloud can significantly reduce our IT costs, whether public or private. At present, private cloud is still the first choice, which has increased from 7% to 27% in the last 3 years. Hybrid cloud is a popular gameplay mode.




DevOps has become a buzzword in the industry. The immediate idea of DevOps is to merge the two roles into one, which is definitely more efficient, so we can see that 86% of enterprises use DevOPs-related tools to some extent, mostly Docker and Jenkins. Understandably, these two tools are by far the easiest to use.




Finally, there is the recognition of development tools and processes. Most companies have a strict R&D process to regulate development work, and 70% of developers benefit from it, and 21% of developers expect their companies to invest more in this area. That’s an interesting number, and probably suggests that 30% of employees are still struggling and waiting for help.




Continuous delivery and DevOps







Given the pain points of our managers, how can continuous delivery and DevOps be used to solve our software production problems? Let’s start by looking at the core of the problem they’re trying to solve.




  • The first is continuous delivery, the core is the flow of small batches of demand, with automatic assembly line, to achieve frequent delivery of software in a short period.
  • What’s at the heart of DevOps? A few key words: a method and culture, automation, measurement and sharing, infrastructure as code.


Here the noun explanation is not my focus, first skim the concept to look at the essence, in fact, both are talking about efficiency and cost of the two words. How to achieve the highest efficiency, how to achieve the lowest cost, is our r & D platform to solve the core problem.




Next, I will focus on these two themes to introduce how Alibaba achieves efficient collaboration and cost control.




Second, efficient collaboration & cost control




Fully automated research and development mode







Speaking of automation, we think of pipelines, but pipelines alone are not enough. They only solve the problem of automating the release process, and cannot completely solve the problem of daily collaboration of developers. Such as when to pull branches, when to merge code, what standards to meet, how to handle the release branch, and how to automatically integrate with the pipeline. These are the most tedious and error-prone things in a development routine.




How to solidify the research and development mode and form several sets of specifications and landing platforms for the whole group? How do you automate all of your development collaboration? This is the first problem to be solved by cloud effect.


In our practice, branch mode, free mode and Gitflow mode are abstracted to the product level, interspersed before, during and after the assembly line, and fully automated r&d mode is realized. The real research and development process is all on the platform, all data can be tracked, and completely eliminate the situation of missing, wrong, chaotic management. Even development only needs to checkout and push, leaving everything else to the platform.


To sum up, standardized operation, efficient collaboration and error avoidance. Branch mode and free mode are already available in the cloud, and other modes and advanced configuration features are coming.


Unify technology stack and operation stack







If the r&d process is unified, then the next step is to face the technology stack and operation stack. I think most enterprises do not like their technology stack to be too complex, which will bring great learning costs to employees. For example, Alibaba mainly focuses on Java, so we can put a lot of energy into the framework, the optimization of JVM, the construction of middleware, and many test schemes will be based on this technology stack. Not to mention, standardization can bring cost savings and efficiency improvement.




Ali ensures the unification of technology stack and operation and maintenance stack from the following five aspects.
  • First we have a view based on different development models and pipelines, and a slightly different pipeline view depending on the technology stack. For example, the distribution of class C applications is a little more complicated than that of Java classes, and they rely on software package updates and parameter maintenance slightly different, which we solve through different process components.
  • Secondly, there are code recommendations, which include various language technology stacks, the latest gameplay and frameworks, making our platform a window to promote new technologies.
  • O&m templates will contain package templates, Dockerfiles, environment plans, and so on. We will provide standardization recommendation, and each BU technical person in charge can also customize and maintain on the platform for their own department. All of this is determined at application creation time and is largely uninvolved and ununderstood by normal development.
  • The last two layers can be summarized to the infrastructure level, a unified application operation and maintenance platform, responsible for the allocation of machine resources and related software resources. System software helps us solve virtualization and operating environment problems.
Only when these five layers work together can the integration of technology stack and operation and maintenance stack be completely realized. In Alibaba, cloud effect is just the unified export of PAAS and IAAS layers to ordinary r&d personnel and responsible for gluing these services together. Such as the latest release technology, middleware isolation technology, gray flow technology, testing technology and so on.




All-cloud testing platform




From the perspective of R&D, technology, operation and maintenance, we talked about how to help the development of efficient collaboration. Here are two examples of cost control.




The first is the all-cloud testing platform. Although Alibaba’s technology stack is generally unified, it still cannot avoid the innovation of each BU in testing. Various tools and platforms will inevitably lead to some repetitive construction and waste of resources. In recent years, we have begun to build a cloud-based testing platform. First of all, all kinds of testing tools are connected to the platform through unified standards, and then the unified scheduling engine provides resource guarantee for testing. When resources are controlled by us, we can innovate many resource saving strategies, such as dynamic scaling, resource pool reuse, off-line mixing and so on. At present, our daily tasks have reached 10W +.









It can be seen from this figure that various Test tools of users are uniformly running on our Test engine, with resources from the group ECS and Docker pool. In addition, we support the access of self-developed automatic Test tools of enterprises in cloud effect, which is convenient for users to promote and implement their own Test schemes in enterprises.




Test environment Isolation


Let’s take a look at the testing environment. Don’t underestimate the resource cost of offline testing. At present, the offline testing scale of physical machines in Ali has reached tens of thousands. With the advancement of microservitization, the number of applications and the complexity of dependencies are rapidly amplified. How to manage the test environment and resources has become very difficult.




Here is a good practice from Ali: test environment isolation. Let’s take a look at this diagram. We divide the daily work environment into three parts. The first one is the development machine, and the initiator of the call request is probably one of my notebooks. The second is the isolation environment, where we deploy the applications A1, B1, and C1 that need to be coordinated. These three applications are fulfilling one of our needs to go live. Of course my call link will be longer, there will be more dependent parties, and naturally we need a base environment in which all our applications will be deployed.









This kind of isolation design not only ensures the business synchronization but also saves a lot of machine resources. Of course, the most important thing is how to realize the isolation between services. Inside ali HTTP unified access, the full link tracking and middleware technology, and help us to achieve no invasive service isolation, requires no additional operational work, only need to draw the corresponding service scope in the system, to transfer into the request will be automatically for dyeing operation, no matter where the request calls, either synchronous or asynchronous, Finally, the mutually exclusive relationship of A, B, and C services in the two environments can be guaranteed.




Similar environmental planning and service isolation capabilities will be available in the cloud in the future. To sum up, environment reuse, one-click application, fast up and down, no code intrusion.




Enterprise-level continuous delivery







Let’s look at how the next continuous delivery platform can be truly enterprise-class.




The first stage is application creation, metadata maintenance, code recommendation, technology stack template, pipeline. The second is test acceptance, static scanning, code specification, security testing and so on. When the test is completed, we need to deploy, which involves the standardization of the environment, the unified environment planning of the enterprise, the operation and maintenance template and the dynamic scalability of the cloud can greatly save costs for us. Before the final release, it is necessary to manage audit, acceptance card point, release window and other control strategies. The final on-line process must require batch rolling, process monitoring, and fast rollback capabilities.


These five contents can basically meet the daily R&D activities of our developers, and constantly flow through these several processes to achieve the ultimate goal of continuous delivery.




Third, Alibaba DevOps landing


App-centric DevOps


Let’s go back to the term “DevOps” and share with you how Alibaba has made DevOps happen, and what the DevOps movement looks like from a developer’s perspective.









The first thing I want to mention is the idea of app-centric DevOps. Application information can be reduced to a type of data in the CMDB, which is naturally user-friendly to developers, and can directly correspond to a service, a code base. Starting from code, pipeline, environment, testing and resources can be connected, and the most peripheral tool chain monitoring, DB, operation and maintenance, middleware and so on.




Connecting the tool chain with an app makes it easier for developers to understand and get through the whole DevOps process. There will be no development code, said service, operation and maintenance said machine, said the room of this chicken and duck talk.




When the tool is passed through the application, the developer can naturally define its application on the platform, and also define operations. For example, I can plan the environment, create resources, set publishing strategies, and so on.




Once the definition is done, whoever defines it is responsible for it, so at Alibaba, development is responsible for the whole life cycle of the application, which is the whole circle we see. By pushing forward with similar concepts and operations automation tools, Dev has subtly taken over the work of OPS and discovered that these things are not that complicated.




Application life cycle self-management




Let’s take a look at a practical example. This is our application launch process, information initialization, code recommendation, configuration Settings, resource application, step by step.









Whether test environment or production environment, whether technical stack or operation stack, all defined by the development engineer. In the whole process, only one audit operation is required, and the application registration and publication can be completed in minutes.




When it comes to life-cycle management, we also work with the measurement system to assess the health status and resource usage of long-tail applications after they are launched, and timely clean up and adjust resources for long-tail applications.




Containerization helps DevOps land


Now containerization. Alibaba started to promote containerization technology on a large scale before November 11, 2016. By November 11 this year, it has completed containerization of almost all active applications, which is a very amazing technological revolution. Why do we push so hard for change, and how does containerization help DevOps?









So if you look at this picture, on the left is what we need to do and the roles we need before containerization. Software package management, baseline change, o&M script modification, resource application, capacity expansion, and so on, which work can not be done without o&M engineers. These things are difficult to do, manual, and complicated to change, resulting in inefficient collaboration between the two roles.




On the right is our containerized situation, where the original package management, baseline changes, and operation scripts are all taken care of by Dockerfile. Managing changes in the code base and pipelining changes greatly simplifies change complexity. For the other part, such as log cleaning and inspection operation and maintenance scripts, we carried them on the OPS tool through the command channel plug-in of the whole network, and took charge of students’ transformation into tool development to maintain the reliability of the plug-in.




Other resource issues are solved by container scheduling, including unified scheduling development, large-scale operation and maintenance related algorithm development, etc. Universal resource utilization problem is solved through artificial intelligence, such as the off-line mixing technology launched on November 11, 2017.




Through containerization, we realize the standardization of the environment, the sinking of operation and maintenance services, and intelligently solve the problems of efficiency and cost. So we can say that through containerization, the concept of DevOps has really landed at Alibaba.




Loose control and strong stuck point







Finally, there is a critical point. When development begins to define operations and take over operations, will we managers have some concerns, such as whether the development of arbitrary operations will lead to online failures, random release will lead to stability problems, etc.




At Ali, the core concept of our platform construction is loose control and strong points. Where is the song first? We have a variety of pipeline options for development. The application owner can completely define various rules of the application, such as how to publish, how to test, and how to configure resources and environment. We have general build and custom build to give users maximum freedom, and finally we have publish light, restore heavy. In each application dimension, developers can use pipelining to deliver code at any time, with no particular constraints, just thinking about how quickly we can recover if something goes wrong.




With enough freedom we can also choose from a number of sticking points, such as code review and quality redlines, protocol checks, release Windows, and front and back end interworking. The purpose of these points is to ensure that all development engineers of the group are on the same track to deliver qualified products.


The core of continuous delivery is to deliver value quickly, give maximum freedom to development, and be responsible for the whole process of development and operation and maintenance. With monitoring, fault prevention and control tools, and functional switches, a balance can be struck between ensuring user experience and delivering value quickly.


Four, cloud effect practice


Overhand cloud optimal path







Finally, we’ll look at how cloud efficiency can improve r&d performance. The quick start on the home page is to create the product space, create the code base, choose our development model, and create the application, which is the very important basic metadata we talked about earlier. Then configure the build, machine, deployment rules, and pipeline delivery.




Cloud Effect provides templates and test machines to help us get started quickly.




Use a continuous delivery pipeline









So what you’re looking at here is a view of the pipeline of our system, slightly different depending on the development model. Cloud effect according to the characteristics of modern enterprise software development, independent research and development of a new assembly line. Different teams, such as R&D, testing, operation and maintenance, can work on different stages without interfering with each other. And there are a variety of trigger mechanisms to choose from, ali’s excellent components are still open.




Non-intrusive build acceleration









Let’s look at the construction. The cloud-based construction scheduling system developed by Yunxiao has a security reinforcement mechanism approved by Ali Cloud security team. In addition, an adaptive build cache strategy is provided according to different technology stacks to avoid repeated downloads of dependencies, greatly saving the build time and improving the development process efficiency. Developers using cloud effects only need to choose their stack and build commands, leaving the rest to the platform automation.




Full network deployment capability


Looking at deployment capabilities, cloud effects can be plugged into and deployed to public, private, and other cloud hosts. It is based on ali’s internal Agent technology, which is safe and efficient. This technology has been deployed in alibaba’s entire network and improved over the years.




As shown in the figure below, regardless of the procurement of cloud effect public cloud or private cloud, our host can be directly connected without exposing the public network. We can choose to interconnect different clouds through the Internet or special line, so as to shield the details of the underlying machine resources from the developers and improve the degree of automation and work efficiency.









Docker, EDAS, ECS arbitrary switch









Cloud effect currently supports three deployment modes of Docker, Edas and Ecs. For each environment of each application, its deployment mode can be defined separately, and arbitrary switching can be realized, which is convenient and quick. For example, the production environment uses EDAS to ensure stability, and the test environment uses ECS for mixed deployment to save resources, which is very convenient.




When we do the o&M stack transformation and upgrade, we can smooth the upgrade by modifying the deployment configuration, and if there is a problem, we can implement one-click rollback. Cloud effect stores the historical baseline data of all software releases and upgrades, which can be checked and rolled back at any time. These are the practical experience accumulated in Alibaba for many years.




The authors introduce


Chen Xin is a senior technical expert of AlibabaResponsible for the construction of alibaba Group’s continuous delivery platform and R&D tools, and devoted to the research and exploration of enterprise R&D efficiency, product quality and DevOps. Led the big data test team, test tool r&d team and continuous delivery platform team in Ali for 6 years. Deep insights into R&D collaboration, testing, delivery, operations and maintenance.