CloudOps: A new trend in application-centric automated o&M.

On December 21, At the annual Summit of Alibaba Cloud Elastic Computing, Tian Taotao, head of Alibaba Cloud elastic Computing Experience and Control System, delivered a speech with the theme of “Efficient and intelligent Cloud, CloudOps makes Operation and maintenance Easier”, giving an in-depth interpretation of the new trend of cloud operation and maintenance, CloudOps. It also introduces the new products of Aliyun CloudOps automated operation and maintenance suite in detail.

Tian Taotao, head of Aliyun elastic computing experience and control system

This paper is mainly based on Tian Taotao’s speech, which is divided into three parts:

  1. From Ops in Cloud to CloudOps;
  2. Application-centric automated operation and maintenance;
  3. CloudOps (Automated Operations on the Cloud) white paper released.

01 From Ops in Cloud to CloudOps

1. Pain points of DevOps implementation

It’s been 12 years since DevOps was first introduced, and many companies have started DevOps with great success. However, enterprises encounter different challenges in DevOps implementation:

◾ Before DevOps transition: Many organizations will find themselves lacking DevOps experts; The initial investment of DevOps is very heavy and requires organizational change and adjustment; Internal tools are weak, and as the business grows, many DevOps tools can no longer meet the needs of the enterprise.

◾ DevOps practice, the focus will shift: organizational effectiveness, more focus on how to achieve efficient and agile delivery; In the aspect of architecture design, we pay attention to how to clarify the dependencies between architectures, deliver applications quickly, and do remote or multi-live migration. In terms of self-service, more and more enterprises are choosing to use self-service. According to Gartner’s “China DevOps Research Report (2021)”, 75% of large enterprises will consider self-service as the most important trend in DevOps applications by 2025.

◾ In the evolution of DevOps, more and more DevOps enterprises are choosing to use intelligent decision making capabilities, including assessing DevOps capability maturity.

2. DevOps in Cloud trends

Combined with the trend of cloud in enterprises, more and more enterprises have begun to use DevOps on public clouds. In this process, cloud transformation and adaptation of applications are required, and cloud-native tools and task flow orchestration are combined to improve delivery efficiency.

In the practice of DevOps on cloud, many enterprises have completed the transformation of microservice architecture and the upgrade of distributed applications, and service governance is becoming more and more mature. However, the surge of applications and the increase of dependency complexity brought by this structure also bring great challenges to the observability of enterprise applications and the stability of the system.

During DevOps’ transition to the cloud, many companies have also made servitization of their Boulder apps. And almost all enterprises believe that open API and as-service are the core competitiveness of enterprise openness and servitization.

3. CloudOps is the new trend of cloud operation and maintenance

Based on the above trends of DevOps in cloud, Aliyun Elastic Computing defines the CloudOps model. Combining the dual advantages of DevOps and cloud, it can be seen from the four dimensions of cost, delivery speed, flexibility and system reliability:

◾ Cost reduction: DevOps can greatly reduce costs through the transformation of organizational effectiveness and the construction of digital tools, while cloud can reduce resource and labor costs through on-demand resource flexibility and multiple resource selection and payment methods.

◾ Delivery efficiency: DevOps can achieve CI/CD, while the cloud can achieve second or minute resource delivery.

◾ Flexibility: Users put forward higher requirements for application development and launch cycle, such as 7 days to deliver an APP, from 0 to launch to the APP store; The cloud can also help customers achieve rapid delivery of resources for a variety of infrastructures.

◾ Reliability: DevOps practices the concept of automation, while the cloud naturally provides high availability of infrastructure.

From the high availability of applications, to the high availability of technical resources, as well as the monitoring and insight capabilities of the system, DevOps and cloud are a very good combination. Therefore, a new concept called CloudOps is proposed on cloud, which fully combines the advantages of cloud and DevOps to achieve the effect of 1+1>2.

02 Application-centric automated operation and maintenance

The core concept of CloudOps is application-centric, because it is the application that matters most to the customer.

During the entire life cycle of an application from construction to delivery, the customer’s focus will change: first, the construction and delivery of the application, how to achieve automatic and agile delivery; After delivery, customers will be concerned about the reliability of the system; One strategy that can quickly improve availability is resiliency, combining resiliency with high availability solutions to upgrade the system architecture; As the application goes online, customers pay more attention to the security compliance and audit work after the application is released. As applications get bigger, customers focus on cost, completing a cycle of constant iteration and improvement.

1. Application automation trilogy

Automation is the basis of system upgrade and transformation, application automation includes several major parts, the most important of which is: automation of infrastructure, operation and maintenance automation, service automation.

1. Infrastructure automation: In the past year, Aliyun has released many products to simplify infrastructure automation. Many companies and enterprises have started to implement automation, but the problem is that the automation templates are run based on the completion of the customer, and today Ali Cloud can make these templates do not make any modification, directly handed to our engine can be executed. At the same time, more and more enterprises are reluctant to use JSON or YAML to define their infrastructure, and our new product ROS CDK, released today, is a good solution to this problem.

In addition, in order to simplify the delivery of automation, it also provides resource migration tools, automatic construction of images, customers can build an ECS image like a container image. At the same time, we will define a mirror family so that users can always automatically select the latest version as they would with container images, without updating configuration files.

2. Operation automation: OOS of our operation and maintenance orchestration opened the task market and released a large number of accumulated best practices and tools in the task market for free, so that users can integrate and use them; At the same time, in order to build a convenient association of multiple applications, we also released application management.

3. Service automation: We always regard self-service problem discovery, troubleshooting and problem solving as our main direction of efforts.

New product: ROS Resource Migration

The first product, ROS Resource Migration, many people think IaC (Infrastructure as Code) is very good, but it’s very challenging in practice. Writing IaC templates is very difficult at first. It requires a lot of complex domain knowledge and an understanding of scripting languages. On the other hand, after the template is written, as the application architecture is upgraded, the template needs to be continuously updated to reflect the latest infrastructure.

To solve this problem, Ali Cloud provides a new solution. Users can use the label function of Ali Cloud. After completing the label, our ROS system will automatically analyze the dependency of the label and help users build a set of IaC templates. In other words, users don’t need to know IaC or write JSON or YAML. Aliyun automatically generates templates. After a template is generated, users can easily deploy the template in multiple availability zones, multiple accounts, and multiple regions, greatly reducing the complexity of building a set of infrastructure templates. After a template is written, you can configure and define an intelligent template to ensure successful template deployment.

3. New capabilities: ROS CDK, ROS’s cloud development suite

In recent years, we have found that many enterprises are eager to embrace CloudOps, but they do not like JSON and YAML, so Ali Cloud also released a new capability this year — ROS CDK(Cloud Development Toolkit), ROS’s Cloud Development suite.

It can use higher-level languages (such as JAVA/Python, etc.) to generate ROS templates directly as a script, and then regenerate the user’s infrastructure from ROS templates. To sum up, you can choose your own development language and familiar programming model to efficiently implement Infrastructure as Code.

4. New tools: Application management

In order to simplify the construction of applications, Aliyun released application management. Application management is very simple, just need to select a label or import existing resources, you can quickly build a set of applications. With an application perspective, it can span multiple products, helping users automate operations, monitoring, release, and CI/CD, greatly simplifying the overall operation and maintenance process and reducing costs.

In addition, the biggest challenge in the application is the application upgrade, including patch management, operating system configuration management, etc. Based on the application perspective, we help users to make the application perspective group, which greatly reduces the threshold of using the application.

◾ Application reliability: After the application is built, the biggest challenge is reliability. Ali Cloud provides strong application reliability capabilities in infrastructure, such as multi-region deployment and multi-availability zone deployment.

◾ Elastic fault tolerance: We build intelligent prediction, which can dynamically recommend required resources according to users’ past utilization and operation of these resources; For transparency, we also open up the ECS event system, which can simulate a physical machine down or a fault tolerant drill for disk I/O hang infrastructure. At the same time, it provides application high availability services, which can simulate traffic protection and fault drill, greatly improving fault tolerance between systems.

◾ Observability construction: We have cloud monitoring, SLS, ARMS, Xtrace and other products, which can provide full-link observation from basic resources to applications and logs to ensure system reliability.

◾ Data backup and recovery: we provide fast snapshot capability, you can complete snapshot creation in seconds. It makes it very safe for users to make changes without having to wait as long as before to make a snapshot. Due to the cost of using snapshots, we developed a new service called Snapshot Retention Cycle. Users can automatically archive or delete unused snapshots to reduce the cost of using snapshots.

5. Safety and compliance capacity building

Security & compliance capabilities are also aliyun and elastic computing infrastructure capabilities, in addition to the basic platform (such as network security and system audit capabilities) and application security, we offer more capabilities today.

When a user operates a security group and has an uncompliant port change, the system automatically sends a warning to the user to help monitor the unreasonable port change and avoid system risks. In application security, in addition to the cloud security center, the control channel security of the operating system is also the focus of our attention.

When many people operate ECS, they like to use SSH/RDP to log in to the server for operation. As for cloud Assistant provided by Ali Cloud, we open the basic API, just like a browser request, users can directly operate on the client side. Many users reported that this is not as convenient and unfriendly as SSH, so we released a new feature called Session Manager.

Using Session Manager, you can directly manage and control hosts without user names or passwords and integrate the Session Manager into the existing system to perform keyless login, authentication, operation, and audit.

In addition, this year, we also released a new function — interception of high-risk commands. When users execute high-risk commands, they can be intercepted, and their operations are added to the playback log. When a user performs a high-risk operation, screen recording is performed by Workbench and uploaded to OSS, thus greatly improving our security and the reliability of the auditable channel.

From the application perspective, the user has a big headache trying to figure out what is the difference between the configurations of two ECS and why some machines have problems and others don’t. Before the user want to analyze this problem is very difficult, through the instance configuration listing of ECS, we will reduce the configuration information, such as the Windows registry, help user to snapshot of configuration information, a snapshot after completion of automatic analysis, analysis of the differences between the two machines, so users can quickly find the differences of the two machines, Greatly reduce troubleshooting time.

We have been pursuing configuration management intensive, we released the key parameters of the ECS management, customers can put the application of the parameters to the Parameter Store management, it, native support for resource scheduling in cloud, the operational plan, and other products, so you can avoid parameters for intensive management configuration without problems. At the same time, using the Parameter Store, users can also do Parameter audit.

These new capabilities can greatly simplify the operation and maintenance of ECS operations, provide secure channels, and achieve intensive configuration management.

03 CloudOps (Automated O&M on the Cloud) White paper published

1. DevOps in Cloud ≠ CloudOps

Is Using DevOps in the cloud CloudOps? Probably not. According to the latest DevOps report for 2021, only 20 percent of enterprises are actually taking full advantage of DevOps in the cloud because of the huge difference between the cloud and the cloud.

◾ First, there are differences in operating methods. The cloud provides many free automatic operation and maintenance tools and integration tools, which can greatly reduce the cost of users, but users need to integrate with existing tools.

◾ Second, there are differences from assets to resources. When managing a resource, it may be considered a resource on the cloud and an asset on the cloud. For example, when managing resources on the cloud, the original machine is released and a new machine is pulled up to complete configuration upgrade and application upgrade. In this way, the asset mode is not concerned. This is the difference between the on-cloud and off-cloud operation modes.

◾ Third, the difference between unification and scale. Cloud scale is very large, can be opened or released at any time a large number of machines, if the wrong operation, may bring a relatively large cost or technical risk to the enterprise.

◾ Finally, the real-time requirements for security and auditing are very high on the cloud.

CloudOps main maturity model and white paper

In our view, CloudOps is not just about DevOps on the cloud, but more about asking users to pay attention to the characteristics of the cloud. These characteristics are summarized into five dimensions, namely, automation capability, resilience capability, reliability capability, safety compliance capability and cost and resource quantification. We have defined the five domains of DevOps on the cloud in detail, and we have defined and graded each domain to form the CloudOps Primary Maturity model.

In the case of automation, the prevailing view is that you want to be unattended, as defined in the CloudOps primary maturity model. With this maturity model, we hope to help customers gauge whether DevOps is mature enough in the cloud and how they can improve their maturity.

In order to better help customers understand our CloudOps maturity model, we have published CloudOps White paper and a CARES model co-written by more than 10 technical experts of AliCloud Elastic Computing. The CARES model covers five aspects: cost management, automation, reliability, elastic capacity management and security compliance. It shows how to find the right operation mode and operation tools on the cloud.

3. Aliyun CloudOps product family display

Many people say that the essence of cloud computing is the automation of operation and maintenance capabilities. Over the past decade, Alibaba Cloud Elastic Computing has been making a lot of tools and efforts to simplify operation and maintenance, aiming to comprehensively improve the DevOps efficiency on the cloud, and has formed a complete CloudOps product family.

◾ cost management, cost optimization scheme and cost payment mode scheme, can greatly reduce user costs.

◾ provides escrow free o&M, including o&M orchestration, patch management, configuration list, and parameter warehouse.

◾ For volume delivery, tools such as OpenAPI and elastic scaling can greatly reduce the complexity of automated delivery.

The ◾ instance operations channel provides a wide range of ways for users to integrate with our Web version as well as with cloud Assistant and newly released tools, greatly reducing the barrier to automated operations.

◾ reliability is what all cloud users are looking for, and we’ve released application management capabilities.

◾ also publishes a full suite of observables, self-help troubleshooting, and event services, most of which are free.

◾ Security compliance, including the convenience of security and compliance audits in the application environment. We integrate many products to improve our overall security compliance capabilities and help customers identify and eliminate security compliance risks in a timely manner.

Ali Cloud Elastic Computing has been committed to providing customers with rich, safe and convenient cloud operation and maintenance products and capabilities from the beginning to today’s era of making good use of the cloud and managing the cloud. In the future, we also hope to work with you to jointly build a more efficient and intelligent cloud operation and maintenance.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.