Authors: Zhimin, Juyuan

Looking back at 2021, what are some of the most significant events in the cloud native space?

1. Accelerated implementation of container-based DISTRIBUTED cloud management:

At the Ali Cloud Summit in May 2021, Ali Cloud released the deployment mode of “One cloud with multiple forms”. A cloud based on Flying Sky architecture can comprehensively cover various computing scenarios from core regions to customer data centers, providing customers with low-cost, low-latency and localized public cloud products.

Before the release of a cloud multi-form, Ali cloud container services in 2019 cloud conference released the cloud under Kubernetes registration cluster capability, support unified management cloud cloud under different Kubernetes clusters. In 2021, ali cloud cloud, the center of the container services further comprehensive upgrade local cloud, unified management, at the edge of the container of cloud clusters to mature cloud native observable deployment to the user environment, safety protection ability, more the cloud can be advanced ability of middleware, data analysis and AI sinking to the local, to meet customer demand for products richness and data control, Accelerate business innovation. With strong elastic computing power and hosted elastic nodes, enterprises can expand their capacity from the local to the cloud as needed, achieving second-scale expansion and calmly coping with periodic or unexpected business traffic peaks.

By 2021, it has become the consensus of enterprises and cloud manufacturers to build distributed cloud architecture based on Kubernetes to shield the differences of heterogeneous environments.

2. Official release of Knative1.0:

As an open source Serverless orchestration framework based on Kubernetes, Knative provides the ability of Serverless application orchestration oriented to Kubernetes standardized API. Knative supports many features: automatic resiliency based on traffic, grayscale publishing, multi-version management, scaling down to 0, event-driven Eventing, and more. According to CNCF 2020 China Cloud Native Survey report, Knative has become the first choice for installing Serverless on Kubernetes.

Knative released version 1.0 in November 2021, the same month That Google announced it was donating Knative to the Cloud Native Computing Foundation (CNCF). Ali Cloud provides hosting of Knative, and provides enhancements such as cold-start optimization and prediction-based intelligent resilience in combination with Ali Cloud infrastructure, realizing the deep integration of community standards and cloud service advantages.

What are the container technology breakthroughs in 2021? What is the problem behind the solution?

In 2021, enterprises will embrace containers more actively, and have higher requirements on the startup efficiency, resource cost and scheduling efficiency of container core technology. Ali Cloud container team also supports the upgrade of the new generation of container architecture, and continues to tap the potential of containers through full-stack optimization of containers, bare metals, operating systems and other technologies.

Efficient scheduling: The newly upgraded Cybernetes scheduler supports NUMA load awareness, topology scheduling and fine-grained resource isolation and mixing for multi-architecture Shenlong, improving application performance by 30%. In addition, a lot of end-to-end optimization has been done on the scheduler, which can provide more than 20,00Pods/min scheduling speed in a 1000 node scale cluster, ensuring that both online services and offline tasks can run efficiently on K8s.

High-performance container network: The latest generation of Ali Cloud container network Terway 3.0, on the one hand, through The Offload chip virtualization network overhead, on the other hand, in the OS kernel through eBPF container Service forwarding and network policy, truly achieve zero loss, high performance.

Container optimization OS: LifseaOS is a lightweight, fast, safe, and mirron-atom managed container optimization operating system for container scenarios. Compared with traditional operating systems, the number of software packages is reduced by 60%, the image size is reduced by 70%, and the initial startup time of OS is reduced from more than 1min to around 2s. Supports the image read-only and OSTREE technologies to versionize OS images. Software packages or hardened configurations on the operating system are updated in the granularity of the entire image.

Extreme flexibility in high-density deployment: Based on Ali Cloud security sandbox 2.0, the resource overhead in the sandbox container is optimized to a minimum of about 30M, realizing the high-density service capability of 2000 instances on a single physical machine. At the same time, the elastic capability of the 6sec 3000 elastic container instance in Serverless scenario is realized by shortening the management links and simplifying components, and by optimizing the sandbox memory allocation process, host Cgroup management process and I/O link.

What are the trends in the scale of enterprise container adoption? What is the core appeal?

As enterprises further use containers on a large scale, the scope of internal use of containers is gradually evolving from online services to AI big data, and there are more and more requirements for the management of heterogeneous resources such as Gpus and AI tasks and jobs. At the same time, developers are considering how to support more types of workloads with a unified architecture and a unified technology stack through cloud native technology to avoid “smokestack” systems, duplication of investment and o&M burden caused by different loads and different architectures and technologies.

Deep learning and AI tasks are one of the important workloads for the community to seek support from cloud native technology. In ali cloud, we put forward the definition of “cloud” native AI, panoramic view of technology and reference architecture, so as to this new technology, to provide best practices can be born, and introduced a cloud native AI suite, the task arrangement, through the calculation of data management, and containers of all kinds of heterogeneous computing resources unified scheduling and operational, Improves resource utilization efficiency and AI project delivery speed of heterogeneous computing clusters such as GPU and NPU.

Based on Kubernetes core Scheduler Framework, a large number of extensions and enhancements are made for the characteristics of AI computing tasks. Task Scheduling policies such as Gang Scheduling, Capacity Scheduling, and Binpack are supported to improve cluster resource utilization. And actively cooperate with THE K8s community to continuously promote the evolution of THE K8s scheduler framework, which ensures that the K8s scheduler can extend various scheduling strategies according to needs through the standard plugin mechanism to meet the scheduling needs of various workloads. At the same time, it avoids the risk of inconsistent data on cluster resource allocation caused by other Custom schedulers.

It supports GPU sharing scheduling, topological awareness scheduling, and customized chip scheduling such as NPU/FPGA to improve the resource utilization rate of AI tasks. At the same time, alicloud’s self-developed cGPU solution provides the isolation of GPU memory and computing power without modifying application containers.

Driven by the separation of computing and storage, Fluid provides a layer of efficient and convenient data abstraction, abstracts data from storage, and realizes the fusion between data and computing through data affinity scheduling and distributed cache engine acceleration, thus accelerating the access of computing to data. Alluxio and JindoFS are supported as cache engines.

Supports elastic scaling of heterogeneous resources, such as Gpus, and avoids unnecessary resource consumption on the cloud through intelligent peak load cutting. It also supports elastic model training and model reasoning.

What are the new requirements for container applications in the enterprise?

With the development of 5G, IoT, audio and video, live broadcasting, CDN and other industries and services, we see an industry trend: enterprises begin to sink more computing power and services closer to the data source or end user, in order to achieve good response time and reduce costs.

This is clearly different from the traditional central cloud computing model, thus extending the edge of computing. As an extension of cloud computing, edge computing will be widely used in hybrid cloud/distributed cloud, IoT and other scenarios. It requires decentralized infrastructure, autonomy of edge facilities, and strong edge cloud hosting capability in the future. The new boundary of cloud native architecture — “cloud side and end integrated” IT infrastructure begins to appear in front of the whole industry, which is also the enterprise’s demand for cloud native technology and container application in the new scene.

The original architecture and technology system of edge computing cloud need to solve the following problems: cloud side operation and maintenance collaboration, elastic collaboration, network collaboration, edge IoT device management, lightweight, cost optimization, etc. In view of the new demand for the integration of cloud side and end, in 2021, OpenYurt community (CNCF Sandbox project) also released versions 0.4 and 0.5, continuously optimizing the IoT device management, resource overhead, network collaboration and other capabilities of edge container.

From the perspective of technology, what are the main problems to be solved in container development?

With the large-scale use and implementation of K8s applications in enterprises, how to continuously improve the overall stability of K8s cluster is the core challenge. As a distributed system, K8s cluster is highly complex. Any problem in application, infrastructure and deployment may lead to the failure of the business system. This not only requires enterprises applying K8s to have high availability system guarantee of cloud native container technology, but also requires the overall upgrade of enterprise cloud native operation and maintenance system concept.

To drive the observability system by SLO definition: the normalization capability of performance pressure measurement is built according to the capacity scale of K8s. It is necessary to have a clear understanding of the business scenarios on THE K8s cluster, including the number of nodes, the number of PODS, the number of jobs, and the QPS number of core Verb. Sort out SLO based on real business scenarios, and keep an eye on golden indicators such as request volume, delay, error number and saturation.

Regular failure drills and chaos tests: For example, Chaos Engineering concept ChaosBlade infuses different exception cases for different risk actions of container cluster, from VM, K8s, network, storage to application fault simulation in all aspects.

Fine-grained flow control risk control: To build protection capabilities for anomalies found during pressure testing and fault drills, Kubernetes has betted a fine-grained flow control strategy with API priority and fairness in 1.20. Ali Cloud container service is also built with a self-developed UserAgentLimiter to further protect K8s.

In addition to the construction of global high availability capability, it is necessary to build platform capability of SRE team:

To create a unified K8s operation and maintenance service interface, precipitation operation and observation capabilities, so that each SRE/DEV can be undifferentiated OnCAll or support, there are two sub-goals: 1) to avoid problems as far as possible; 2) Find and locate problems as soon as possible, and recover problems as soon as possible, and build a global high availability emergency system.

Practice and drill: Practice based on the scene, knowledge and action. It is a closed loop from the trigger to the completion of the line, and then continuously through a circular process of knowing the line. In the event belt training, such as double Eleven promotion, power limit, network outage and other extreme scenarios, stability construction needs to be carried out for extreme scenarios, capacity planning and pressure measurement, component governance, etc., all need some special scenarios. With the arena, to fight this battle, we need to work together, will constantly form a large collaborative mechanism.

Solidify knowledge and precipitate Playbook: This is to create standards. In the process of making standards, some of them should be incorporated into the system first, some into playbook, and some into the process. The process must be the best practice of our excellent engineers and SRE. The systems, the Playbook, the processes are constantly transforming and reinforcing each other.

What will be the focus of container technology in 2022? What are the possibilities for the future of containers?

The Forrester WaveTM, a global container Capability report, was released by Forrester recently. PublicCloud Container Platforms, Q1 2022. According to the report, AliCloud is the only service provider in the “leader” quadrant of the report, and has the highest comprehensive ability score of Container products.

Ali Cloud Container Technology will focus on several directions in 2022:

Green and low-carbon: Continue to leverage the efficient scheduling and flexibility of container technology to help enterprises improve their overall IT efficiency. Combined with the latest energy-saving data center technology, the new generation of DpCA architecture, self-developed chips, container optimization operating system to achieve upstream and downstream full stack optimization, improve the overall performance and scheduling efficiency of the application. In a data-driven way, intelligent scheduling and real-time adjustment can be realized according to the application runtime resource portrait, simplifying the complexity of application resource allocation, further improving the mixed deployment of applications, reducing resource costs, and facilitating the overall FinOps management of enterprises.

AI engineering: In order for AI to become enterprise productivity, it must use engineering technology to solve the problems of model development, deployment, management, prediction, reasoning and other whole link life cycle management. We see three things that need to be done in AI engineering: cloud genification of data and computing power, scaling of scheduling and programming paradigms, and standardization and universality of development and services. It is necessary to continuously optimize the efficient scheduling of heterogeneous architecture such as GPU, and comprehensively upgrade the AI engineering capability by combining distributed cache and distributed data set acceleration technologies, as well as AI task pipeline and life cycle management of KubeflowArena.

Intelligent autonomy: By introducing more data-based intelligent means, promote intelligent container operation and maintenance system, reduce enterprise management of complex container clusters and applications, enhance self-healing and self-recovery ability of K8s master, components and nodes, and provide more friendly abnormal diagnosis, K8s configuration recommendation, elastic prediction and other capabilities.

Security compliance: Promote DevOps to DevSecOps. Optimize overall security definitions, signatures, synchronization, and tripartite delivery for OCI Artifacts, such as Helm and Operator; Harden the network isolation and governance of the north-south and east-west containers to promote zero-trust link security. Further improve the performance and observability of secure containers and confidential computing containers.

Click here to enter aliyun ACK Anywhere official website.