This article is based on the transcript of sun Jianbo’s speech at ArchSummit 2019 Beijing. Firstly, it introduces the problems that Alibaba encountered in the process of large-scale application practice based on Kubernetes project. The existing practices to solve these problems and their limitations will be introduced one by one. Finally, we will introduce alibaba’s ongoing efforts and the development direction of the community in this field.
Today, Alibaba maintains dozens of large-scale K8s clusters internally, the largest of which is about 10,000 nodes, each serving tens of thousands of applications; We also maintain a K8s cluster with tens of thousands of users on Aliyun’s Kubernetes service ACK. Once we had solved the scale and stability issues to a certain extent, we found that managing applications on the K8s was a big challenge.

Two challenges to application management

Today we are going to focus on these two challenges:

  • For application research and development, K8s API is too complex for simple applications and difficult to get started for complex applications.
  • For application operation and maintenance, K8s scalability is difficult to manage; K8s native apis do not fully cover cloud resources.
Overall, our challenge is how to provide a true application management platform based on K8s, so that r&d and operations only focus on the application itself.

Application management for R & D

K8s all in One YAML file

Let’s take a look at the YAML file for K8s. The YAML file has been simplified, but we can see that it is still quite long.





With a YAML file that has been criticized for being “complex,” I’m sure it’s tempting to think about how to simplify it.

From top to bottom, we roughly divide them into three parts:

  • One is the parameters related to expansion, shrinkage and rolling upgrade, which should be more concerned by students who apply operation and maintenance.
  • Then the middle piece is the mirror, port, startup parameters related, this piece should be more concerned about the development of students;
  • The last one you may not understand at all, and in most cases you don’t need to understand it, but it can be understood as the K8s platform layer students need to care about.
Looking at a YAML file like this, it’s easy to think that you just need to wrap the fields inside and expose the ones that need to be exposed. Indeed, we have a PaaS platform to do this internally.

Only part of the field: simple but inadequate

Some internal PaaS platform carefully selected some fields and made a beautiful front-end interface for users, only revealing about 5 fields to users, greatly reducing the mental burden of users to understand K8s. The underlying implementation then renders the user’s five fields into a complete YAML file in a template-like fashion. The highlighted fields should look something like this:





It has to be said that this approach is very effective. For simple stateless applications, streamlined API can greatly reduce the threshold of K8s, and connect users quickly and efficiently. PaaS platform has also been successfully used by everyone. At the same time, I have learned from some technical sharing that many other companies have simplified their K8s apis in a similar way.

However, when users start to connect their businesses on a large scale, we will naturally encounter stateful complex applications and users will start to complain that the PaaS platform is not capable enough. For example, the logic of Zookeeper multi-instance master selection and master/slave switchover is difficult to expand in these five fields.

The bottom line is that shielding a large number of fields can limit the evolution of the infrastructure’s own capabilities, but THE capabilities of K8s are very powerful and flexible. There’s no way we’re going to give up the great capabilities of the K8s for simplicity.

As in the current example, it is easy to imagine that complex stateful applications should be handled by CRD and Operator in K8s. ,

CRD+Operator: K8s is powerful but difficult to use

Sure, when we were internally working with cloud protobioses for complex applications, we recommended that they write operators, but this conversation often came up.





Middleware engineer said to us, I have a Zookeeper which K8s Workload access should be used? We thought that the K8s was so well designed that there was no problem we couldn’t solve, so we recommended Operator. They were like, you know, you’ve been doing cloud native for years and you’ve been inventing new words, and you’ve never heard of them before.

If you think about it, it’s not hard for the business side to understand these new concepts, but it’s still very difficult to actually implement them on their own. Naturally, we also felt that the business side should focus on their business, so we had to help them write it.

As can be seen, we urgently need a unified model to address the application management demands of R&D.

Application management requirements of O&M

In addition to the problems on the R&D side, we also encountered great challenges on the operations side.

Operation and maintenance capabilities are numerous but difficult to manage

The CRD Operator mechanism of K8s is very flexible and powerful. Not only complex applications can be realized by writing CRD Operator, but also our operation and maintenance capabilities can be greatly expanded by Operator, such as grayscale publishing, traffic management, elastic expansion and shrinkage, etc. We often admire the flexibility of K8s, which makes it very easy for our base platform team to provide external capabilities, but it is difficult for application operations to use those capabilities that we provide.

For example, we launched a CronHPA that could be periodically set to adjust the range of instances per CPU at each stage. Application operations did not know that there would be conflicts with native HPA without timing, and we did not have a unified channel to help manage so many complex scaling capabilities, which naturally caused failures. This bloody lesson reminds us to check before doing things. Being familiar with the mechanism of K8s easily makes us think of adding admission webhook for each Operator.

Admission Webhook needs to obtain all the operation and maintenance capabilities bound to the application and the operation mode of the application itself, and then conduct unified verification. If all of these operations and maintenance capabilities are provided by one party, it’s fine. If there are two, or even three, extended capabilities, there is no unified way to know. In fact, if we think further, we need a unified model to negotiate and manage these complex extensibility capabilities.

Cloud resources are difficult to describe and deliver uniformly

Once we have the application Operator and corresponding operation and maintenance capabilities written, it is easy to package and deliver the application so that both public and private clouds can interact in a unified way. The dominant approach in the community right now is to use Helm to package applications, and we have taken this approach to deliver to our users, only to find that our users need more than that.

A big feature of cloud native applications is that they tend to rely on resources on the cloud, including databases, networks, load balancers, and caches.





When we used Helm packaging, we could only target the K8s native API, and if we wanted to start the RDS database, it was more difficult. If you do not want to go to the database interaction page, want to manage through THE K8s API, then you have to write a CRD to define, and then call the actual cloud resource API through Operator.

This set of deliverables is essentially a complete description of an application, what we call an “application definition.” But in fact, we found that “app definition” was missing in the cloud native community. That’s why a number of teams within Alibaba have tried to design their own “defining apps.”





This way of defining all configurations will end up in one file, which is the same or worse problem with the K8s API All-in-one. In addition, these application definitions eventually become black boxes, except for the corresponding project itself can be used, other systems are basically unable to reuse. It’s even harder to reuse multiple collaborations.

Each company and team is defining its own application

It’s not just alibaba’s internal teams that need app definitions. In fact, almost every company and team that manages apps based on K8s is defining their own apps. Here’s what I found for two companies:





Application definition is actually an integral part of application delivery/distribution. However, in specific practice, we feel that these internal application definitions are faced with the following problems:

  1. Is the definition open enough to reuse?
  2. How to collaborate with the open source ecosystem?
  3. How to iterate and evolve?
All three of these challenges are huge, and as I mentioned above, an app definition needs to be easy to learn, flexible, and not a black box. Application definition also needs to be closely integrated with the open source ecosystem. Without ecology, application definition is doomed to have no future, and it is difficult to continue to iterate and evolve naturally.

Layered models that distinguish consumers from modular packages

Let’s go back and look at the challenges we face. The bottom line is that K8s’s All In One API is designed for platform providers, and we can’t have application development, operation and maintenance facing the same API as the K8s team, as shown on the left.





A reasonable application model should have a hierarchical structure that distinguishes the user roles and encapsulates the operation and maintenance capabilities in a modular manner. Make different roles use different apis, as shown on the right.

OAM: Application-centric LAYERED MODEL of K8s API

OAM(Open Application Model) is just such an application-centered LAYERED Model of K8s API:

  • From a development perspective, the API object he operates on and focuses on is called Component;
  • From the perspective of operation and maintenance, the encapsulation of modular operation and maintenance capability is Trait, and operation and maintenance can combine Component and Trait freely through App Config and finally instantiate into a running application.
  • The K8s team itself continues to iterate on this layer of capabilities based on K8s’s native API.




In view of the K8s API that r&d often complained was too complex, we solved the problem by separating concerns and distinguishing the API faced by users. At the same time, we provided several core Workload so that R&D only needed to fill in a few fields to complete the component definition. For complex application definitions, we extended Workload to allow r&d to interconnect with the CRD Operator.

OAM models are provided through traits to meet the requirements of modular packaging, shipping and maintenance capabilities and global management required by operations and maintenance. Traits are the embodiment of operation and maintenance capabilities. Different traits also correspond to different types of operation and maintenance capabilities, such as log collection traits, load-balancing traits, and horizontal scale-up traits. Meanwhile, OAM itself provides a standard for global management. The implementation layer of OAM model can easily manage and check various Trait descriptions in THE DEFINITION of OAM.

OAM also provides a unified API for cloud resources, again divided into three categories by concerns:

  • One is the cloud resources that r&d focuses on, such as database RDS and object storage OSS, which are accessed through extended Workload.
  • The other is cloud resources concerned by operation and maintenance, such as SLB, which are connected through traits.
  • The last type of cloud resources is also concerned by O&M, but may contain the association between multiple applications, such as virtual private networks (VPCS), which are accessed by application Scope. Scope is a concept in OAM to manage the linkage relationship between multiple applications.
As you can see, OAM solves all of the challenges we’ve discussed today through a unified set of standards. Let’s dive into THE OAM and see what the different concepts are.

OAM Component: API of r&d concern

Component is the API object provided by the OAM model for development, as shown below:

ApiVersion: core.oam.dev/v1alpha1 kind: Component Metadata: name: Nginx Annotations: version: v1.0.0 Description: > Sample component schematic that describes the administrative interfaceforour nginx deployment. spec: workloadType: Server osType: linux containers: - name: nginx image: name: Nginx: 1.7.9 digest: "sha256:... > env: - name: initReplicas value: 3 - name: worker_connections fromParam: connections workloadSettings: ... parameters: - name: connections description:"The setting for worker connections"
    type: number
    default: 1024
    required: falseCopy the code
Component is a K8s CRD. The spec field defines Component. The first important field is workloadType, which determines the running mode of the application.

For simple applications, OAM provides six types of core Workload, as shown in the following table:





It is mainly distinguished by whether it can be accessed, copied and run for a long time. For example, Server represents the most common Deployment+Service combination in K8s.

The Component that fills in the core workloadType only needs to define the injection image, startup parameters, and other fields in the Container, just like the PaaS that shields a large number of fields in the beginning, which greatly reduces the threshold for users.

For complex stateful applications, OAM allows for extended Workload. As shown in the figure below, we can define a new WorkloadType called OpenFAAS, whose definition is essentially equivalent to a CRD definition.





In the OAM model, using the custom Workload is also done through Component, but the WorkloadType is changed to your custom WorkloadType name.

OAM Trait: Discoverable, manageable operations and maintenance capabilities

Traits are modular operations capabilities, and we can use command-line tools to discover which Traits are supported in a system.

$ kubectl get traits
NAME            AGE
autoscaler      19m
ingress         19m
manual-scaler   19m
volume-mounter  19mCopy the code
At this point, it is very simple for o&M to check how specific O&M capabilities should be used:

$ kubectl get trait ingress -o yaml apiVersion: core.oam.dev/v1alpha1 kind: Trait metadata: name: Annotations cron-Scaler annotations: Version: V1.0.0"Allow cron scale a workloads that allow multiple replicas."
spec:
  appliesTo:
    - core.oam.dev/v1alpha1.Server
  properties: |
    {
      "$schema": "http://json-schema.org/draft-07/schema#"."type": "object"."required": [
        "schedule"]."properties": {
        "schedule": {
          "type": "array"."description": "CRON expression for a scaler"."item": {
            "type": "string"}},"timezone": {
          "type": "string"."description": "Time zone for this cron scaler."
        },
        "resource": {"type": "object"
          "description": "Resources the cron scaler will follow"."properties": {
             "cpu": {
               type: "object". } } } } }Copy the code
As can be seen, he can clearly see in the Trait definition which Workload type this o&M capability can apply to, including which parameters can be filled in, which required/optional parameters, and what function description of parameters are. You can also see that in OAM apis such as Components and traits are schemas, so they are the complete set of fields for an object, and the ability to understand the description of that object. The best way.

In fact, you may have noticed that traits are defined as CRD equivalents, and that you can implement traits through operators as well.





So the OAM actually manages the messy Operator organically through different roles.

OAM Application Config: Assembles Components and traits and runs the Application instantiation

Component and traits are finally combined through Application Configuration and actually run.





More importantly, the OAM application description file is completely self-contained, which means that through OAM YAML, as a software distributor, we can completely track all the resources and dependencies that a software needs to run. This makes an application, we only need an OAM configuration file, you can quickly run the application in different operating environments at any time, the self-contained application description file complete delivery to any operating environment.

The Rudr project in our figure is an implementation of OAM, or an interpreter for OAM, translating the unified description of OAM into the numerous operators behind it. Rudr is also a unified management medium. If a Component binds two traits and conflicts with each other in the Application Configuration, it can be quickly checked and problems can be found, as shown in the figure below.





Similarly, orchestration of complex applications, pulling up cloud resources, Workload and Trait interactions, and more, can be implemented within the OAM interpreter.

You can experience these OAM interactions through the tutorial documentation in the Rudr project.

Kubernetes PaaS supported by OAM

In fact, PaaS supported by OAM is based on Kubernetes, which hierarchically manages many operators.





For r&d, the application he is usually interested in might be a combination of web and database applications, with an RDS Operator implementation behind the database component. Behind the Web application can be our open source enhancement project OpenKruise of K8s native StatefulSet. Enhanced capabilities provided by OpenKruise, including in-place upgrading, are configured through traits. Some additional monitoring capabilities, such as alarm and log, are realized by independent operators, which are paid attention to and managed by operation and maintenance at the second layer.

Finally, the K8s team united with various providers of basic software to continuously provide extended capabilities in the form of Operator around the K8s native API, and standardized output to the outside through unified norms and standards such as OAM.

More importantly, the uniform description of THE OAM greatly improves the reuse of operators, so that operators are written primarily to focus on the business logic itself. For example, when you write a Zookeeper Operator, you need to write the service discovery of the instance, the orchestration logic for the master/slave switchover during the upgrade, and the logic for the backup of the instance. Under the standardization of OAM, you can easily find similar components in the community.

Kubernetes PaaS, supported by OAM, enables different operators to be assembled as flexibly as Lego blocks, making application definition a project jointly built by the community, making application management unified but more powerful!

The last

Finally, I would like to share with you the recent plans of the OAM project. OAM is a completely community-owned application definition model, and we really want you to participate in it.







The original link

This article is the content of Ali Cloud, shall not be reproduced without permission.