The author | Sun Jianbo (alibaba technology experts), Zhao Yuying

Introduction: In the era of cloud native, Kubernetes is becoming increasingly important. However, most Internet companies haven’t been as successful with Kubernetes as they thought, and the inherent complexity of Kubernetes is enough to deter some developers. In this article, Alibaba technical expert Sun Jianbo in an interview based on Alibaba Kubernetes application management practice process to provide some experience and suggestions, in order to be helpful to developers.

In the Internet era, developers are more likely to achieve rapid switching when resource-related problems occur through top-level architecture design, such as multi-cluster deployment and distributed architecture, do a lot of things to make flexibility easier, and improve resource utilization through mixed computing tasks. The emergence of cloud computing solves the transition from CAPEX to OPEX.

The era of cloud computing allows developers to focus on the value of the application itself, compared to the past when developers had to put a lot of effort into storage, network and other infrastructure, which is now as convenient and easy to use as hydropower and coal. Cloud computing infrastructure is stable, highly available, flexible and scalable. In addition, it also solves a series of application development “best practices” issues, such as monitoring, auditing, log analysis, gray publishing, etc. It used to be that an engineer needed to be very thorough in order to build a highly reliable application. Now, if you know enough infrastructure products, these best practices are at your fingertips. However, many developers are helpless against the natural complexity of Kubernetes.

Nick Young, Atlassian’s lead engineer on the Kubernetes team, the company behind Jira and the Bitbucket code base, said in an interview:

While the Kubernetes strategy was the right one (at least until now, no other alternatives have been identified) and solved many of the problems encountered at this stage, the deployment process has been extremely difficult.

So, is there a good solution?

Too complicated Kubernetes

“If I were to say that Kubernetes has a problem, of course it’s ‘too complicated,'” Sun said in an interview. “But it’s really because of Kubernetes’ positioning.”

Kubernetes is positioned as “platform for platform”, Mr Sun added. Its direct users are neither application developers nor application operations, but “platform Builders”, which are infrastructure or platform level engineers. However, for a long time, Kubernetes project has been misused a lot of times, a lot of application operations people, even application development work directly around Kubernetes very low level API. This is one of the root causes of many people complaining that Kubernetes is too complicated.

It’s as if a Java Web engineer had to use Linux Kernel system calls directly to deploy and manage business code, naturally finding Linux “so anti-human.” As a result, the Kubernetes project currently lacks a higher level of encapsulation that would make the project more friendly to the upper software development and operations personnel.

If the above positioning can be understood, then it makes sense for Kubernetes to design the API objects as all-in-one, just like the Linux Kernel API, with no distinction between users. However, when developers really want to manage applications based on K8s and connect with R&D and operations engineers, they have to think about this problem and how to solve this problem in a standard and unified way like another layer of Linux Kernel API. This is also the reason why Ali Cloud and Microsoft jointly Open Cloud native Application Model (OAM).

Stateful application support

In addition to natural complexity issues, Kubernetes’ support for stateful applications has been a problem that many developers have spent a lot of time researching and solving. It is not impossible to support, but there is no relatively superior solution. Currently, the mainstream solution for stateful applications in the industry is Operator, but it is actually very difficult to write Operator.

In an interview, Sun jianbo said that this is because Operator is essentially an “advanced version” of the K8s client, but the DESIGN of the K8s API Server is a “heavy client” model, which is of course to simplify the complexity of the API Server itself. As a result, both the K8s client library and the Operator based on it have become extremely complex and difficult to understand: they are mixed with a lot of implementation details of THE K8s itself, such as Reflector, Cache Store, informer, etc. These should not be the concern of Operator writers, who should be domain experts in the stateful application itself (such as TiDB engineers), not K8s experts. This is now the biggest pain point for K8s stateful application management, and it may require a new Operator framework to address this problem.

On the other hand, support for complex applications is not just about writing operators, but also about the technical underside of stateful application delivery, something that is intentionally or unintentionally neglected by the ongoing delivery projects in the community today. In fact, consistently delivering a stateful operator-based application is of a different order of magnitude from the technical challenges of delivering a stateless K8s Deployment. This is an important reason why Sun jianbo’s team advocates the “application Delivery layered model” in CNCF SIG App Deliver: As shown in the figure below, the four layers of model are respectively “Application definition”, “application delivery”, “application operation and Automation” and “platform layer”. Only through the joint cooperation of different capabilities of these four layers can we truly deliver stateful applications with high quality and high efficiency.

For example, Kubernetes API objects are designed to be “all-in-one”, meaning that all participants in the application management process must collaborate on the same API object. As a result, developers will see that in API object descriptions such as K8s Deployment, there are fields of application development concern, fields of operation concern, and some fields may be concerned by multiple parties.

In fact, whether it’s application development, application operations, or K8s automation capabilities like HPA, they may all need to control the same field in an API object. The most typical case is the replica parameter. But who owns the field is a tricky one.

To sum up, since K8s is positioned as the Linux Kernel in the cloud era, Kubernetes must continue to make breakthroughs in Operator support, API layer and various interface definitions, so that more ecological participants can better build their own capabilities and values based on K8s.

Alibaba large-scale Kubernetes practice

At present, The application scenarios of Kubernetes in Alibaba economy cover all aspects of Alibaba’s business, including e-commerce, logistics, online computing, etc., which is also one of the main forces supporting Alibaba’s Internet-based promotion such as 618 and Double 11. Alibaba Group and Ant Financial operate dozens of super-sized K8s clusters internally, the largest of which is around 10,000 machine nodes, and that’s not really the maximum capacity. Each cluster serves tens of thousands of applications. On Aliyun Kubernetes service (ACK), we also maintain a K8s cluster of tens of thousands of users, the scale and technical challenges are second to none in the world.

Sun Jianbo revealed that Ali started to apply containerization as early as 2011. At that time, it started to build containers based on LXC technology, and then began to use self-developed container technology and scheduling system. There was no problem with the whole system itself, but as the infrastructure technology team, the goal must be that Ali’s basic technology stack can support a broader upper-layer ecosystem and can constantly evolve and upgrade. Therefore, the whole team spent more than a year to gradually make up for the shortcomings of K8s in scale and performance. Overall, upgrading to K8s is a very natural process, and the practice process is actually quite simple:

  • Firstly, to solve the problem of application containerization, we need to make reasonable use of K8s container design mode;
  • Second, to solve the problem of application definition and description, which needs to make reasonable use of OAM, Helm and other application definition tools and models to achieve, and to be able to connect with the existing application management capabilities;
  • Third: Build a complete application delivery chain, where you can consider using and integrating continuous delivery capabilities.

If the above three steps are completed, we will have the ability to connect with R&D, operation and maintenance, and upper PaaS, and be able to explain clearly the value of our platform. You can then pilot out and replace the underlying infrastructure step by step without affecting the existing application management system.

Kubernetes itself does not provide a complete application management system, which is built from the entire cloud native ecosystem based on K8s, as shown in the following figure:

Helm is one of the most successful examples, located at the top of the application management system (Layer 1), along with various YAML management tools such as Kustomize, and packaging tools such as CNAB, which correspond to Layer 1.5. And then there are Tekton, Flagger, Kepton, and other application delivery projects, which correspond to layer 2. Operators, as well as the various workload components of K8s, such as Deployment and StatefulSet, correspond to layer 3. Finally, there are the core functions of K8s, which are responsible for managing workload containers, encapsulating infrastructure capabilities, and providing apis for various workloads to connect to the underlying infrastructure.

Initially, the team’s biggest challenges were size and performance bottlenecks, but this solution was also the most straightforward. “As we scale up, the biggest challenge we see in scaling up K8s is actually how to manage applications and connect with the upper ecosystem based on K8s,” Sun said. For example, we need unified control of hundreds of controllers from dozens of teams with different purposes; We need to deliver production-level apps from different teams nearly 10,000 times a day, with radically different launch and scaling strategies; We also need to connect to dozens of more complex upper platform, mixed scheduling and deployment of different types of jobs to pursue the highest resource utilization, these appeals are alibaba Kubernetes practice to solve the problem, scale and performance is only one of the components.

In addition to the native functions of Kubernetes, alibaba will develop a large amount of infrastructure to connect to these functions in the form of K8s plug-ins. With the expansion of scale, discovering and managing these capabilities in a unified way has become a key issue.

In addition, Alibaba also has a large number of existing PaaS, which are built to meet the needs of users in different business scenarios. For example, some users want to upload a Java War package to run, and some users want to upload a mirror image to run. Behind these requirements, alibaba teams do a lot of application management work for users, which is also the reason for the emergence of stock PaaS, and the docking process between these stock PaaS and Kubernetes may cause various problems. At present, Ali is helping these PaaS to connect and converge to K8s chassis through OAM, a unified standard application management model, to achieve standardization and cloud biogenics.

Decoupled operation and r&d

Through decoupling, the Kubernetes project and the corresponding cloud service providers can expose different dimensions of declarative apis for different roles that better match the user’s needs. For example, the application developer only needs to declare in the YAML file that “Application A will use 5 gigabytes of read/write space”, while the application operations personnel need to declare in the corresponding YAML file that “Pod A will mount 5 gigabytes of read/write data volumes”. This focus on “letting users care about what they care about” is the key to lowering the learning barrier and difficulty of Kubernetes users.

Sun jianbo says most of the current solutions are actually “pessimistic processing”. For example, in order to reduce the burden of R&D, ali’s internal PaaS platform has only been open to R&D with 5 Deployment fields for a long time. This is, of course, because the K8s YAML “all-in-one” design makes full YAML too complex for development, but it also results in the K8s’ own capabilities being, for the most part, completely motion-free for development. However, for the OPERATION and maintenance of PaaS platform, he thinks that K8s YAML is too simple to describe the operation and maintenance capability of the platform, so he needs to add a lot of annotations to YAML files.

Moreover, the core problem here is that the result of this “pessimistic approach” is that the operations person himself is “dictatorial”, doing a lot of detail work and doing a lot of thankless work. For example, the expansion strategy is now completely dictated by operations. However, as the actual personnel who write the code, it is the R & D personnel who have the most say on how to expand the application, and the r & D personnel also very much hope to tell their opinions to the operation and maintenance, so that K8s can be more flexible and truly meet the needs of expansion. But this is not possible in the current system.

Therefore, “DECOUPLING of R&D and operations” is not to separate the two, but to provide a standard and efficient way for R&D to communicate with operations, which is also the problem OAM application management model to solve. Sun jianbo said that one of the main functions of OAM is to provide a set of standards and norms for r&d to express their demands from their own perspective, and then this standard “you know, I know, the system knows”, so that these problems can be solved.

Specifically, OAM is a standard specification that focuses on describing applications. With this specification, application descriptions can be completely separated from the details of infrastructure deployment and application management. The design benefits of this Seperation of Conerns are obvious. For example, in a real production environment, whether Ingress, CNI, or Service Mesh, the seemingly consistent operation and maintenance concepts can vary greatly among different Kubernetes clusters. By separating the application definition from the cluster’s operational capabilities, we enable application developers to focus more on the value points of the application itself, rather than the operational details of where the application is deployed.

In addition, separation of concerns allows platform architects to easily encapsulate platform operations capabilities into reusable components, allowing application developers to focus on integrating those operations components with code to quickly and easily build trusted applications. The goal of OAM is to make simple application management easier and complex application delivery more manageable. Sun Jianbo said that in the future, the team will focus on gradually promoting this system to cloud ISV and software distribution side, so that k8S-BASED application management system really become the mainstream of the cloud era.

Guest introduction: Sun Jianbo, alibaba technical expert. Kubernetes Project community member. Currently, I am involved in the delivery and management of large-scale cloud native applications in Alibaba. In 2015, I participated in writing the technical book Docker Container and Container Cloud. Once worked in Qiniu, and participated in the cloud process of project related applications such as timing database, streaming computing and log platform.

At ArchSummit Global Architect Summit in Beijing on December 6-7 this year, Mr. Sun Jianbo will continue to share the experience and Lessons of Alibaba Kubernetes Application Management Practice, and introduce the existing practices of Alibaba in the process of decoupling R&D, operation and maintenance, as well as the problems existing in the practice itself. As well as the implementation of standardization, unified solution, and further thinking of the community.

“Alibaba cloudnative wechat public account (ID: Alicloudnative) focuses on micro Service, Serverless, container, Service Mesh and other technical fields, focuses on cloudnative popular technology trends, large-scale implementation of cloudnative practice, and becomes the technical public account that most understands cloudnative developers.”

More detailed information can be paid attention to “Alibaba Cloud original”.