After Wang Yuan, vice President of netease, Executive President of Hangzhou Research Institute, Chairman of Internet Technology Committee and General Manager of netease Sufan delivered a keynote speech “Building an Open Cloud native Operating System and System Software Architecture” at the ArchSummit Global Architect Summit 2021 Shanghai station, Zhang Xiaolong, member of netease Technical Committee and director of infrastructure of netease Sufan, further explained the thinking, implementation and experience of Netease Sufan on cloud native middleware to the participants. This is a transcript of the lecture.

Today, I would like to share with you our production-oriented middleware container practice, which mainly includes four parts:

The first part introduces the technical evolution path of netease’s solution to these challenges and the reasons for middleware containerization from the operation and maintenance challenges faced by the basic middleware.

The second part introduces the requirement of middleware container and the overall platform architecture of netease Shufan.

In the third part, we give our thinking and best practice on some common problems in the process of middleware containerization.

Finally, the work of middleware container is summarized and the plan for the future.

Basic middleware challenges

Before the advent of container technology, basic middleware technologies such as MySQL, Redis, Kafka and so on have long been open source, and become standard components of server architecture design. For a typical Internet application, three major middleware such as database, cache and message queue are essential.

It is very simple for architects to use these middleware to construct application platforms, but operation and maintenance personnel encounter major problems, including the following five aspects:

  1. Middleware itself is a complex distributed system. O&m needs to understand the working principle of these distributed systems and write appropriate o&M scripts, which is very complex.
  2. The operation and maintenance efficiency is relatively low. There may be no problem with manual operation and maintenance for MySQL instances less than 50, but if 500 or 1000 database instances, or thousands of Redis instances such as netease Cloud Music, are still operated and maintained by manual scripts, the efficiency will be very low.
  3. Lack of stability, this is because the operation and maintenance personnel always use manual script operation and maintenance, online copy command, careless copy wrong command may break down middleware;
  4. Traditional middleware is deployed on physical machines, and physical mechanisms cannot provide strong resource elasticity.
  5. All the relatively senior middleware o&M are basically based on the Internet, because these o&M is very complex, and it is difficult for ordinary enterprises to recruit a very professional O&M. We believe that the best practice to solve this challenge is to make the middleware o&M capabilities cloud service.

Turning these middleware into cloud services has several advantages. First, the operation and maintenance is simple and easy to use. Second, it can efficiently realize the automatic operation and maintenance of large quantities of instances. Third, it has strong SLA guarantee, because there is no need to type too many manual commands. The fourth is rapid expansion of IaaS elastic resource capability. Finally, because the whole operation becomes simple, there is no need for a large number of professionals to help the business operation and maintenance of middleware.

In fact, public cloud manufacturers have also seen this trend. The three major public clouds in China have made the open-source base middleware into cloud services. I think there are two main reasons for this. First, IaaS resource level competition tends to be homogenized. Making PaaS middleware into cloud services can consume more resources and bind users more deeply. Secondly, as a value-added service on the cloud, the gross profit rate of middleware is much higher than that of cloud host and cloud hard disk. Therefore, many public cloud users do not like RDS and buy cloud host for MySQL.

In order to solve the challenge of the complexity of middleware operation and maintenance, netease developed a cloud-based middleware platform six or seven years ago. This platform has some technical features. First, it provides resource elasticity based on IaaS. That is, the computing resources used by the middleware are cloud hosts, storage resources are cloud disks, and network resources may be in tenants’ VPCS.

Second, it adopts the Tenant isolation strategy of IaaS. If a tenant wants middleware instances, the platform will automatically help him set up with his cloud host and cloud hard disk, which can achieve good isolation between different tenants.

We had developed six basic middleware cloud services. The business team needed middleware to develop the product, and it just needed access to these cloud services. It didn’t need to do it all over again. We mainly work on the left side of control management, such as instance high availability, deployment and installation, instance management, etc. We also achieved some success at that time, greatly improving the operations team’s ability to operate and maintain middleware.

Over time, the first generation of basic middleware has exposed three major flaws that are difficult to address. The first defect is the limiting performance deficiency. Because it uses KVM virtual machines as computing resources, performance loss is much greater than running on physical servers, and it cannot meet the harsh requirements for middleware performance and stability under high load/pressure.

The second is to realize the resource cost is too high, because it is based on it to provide resource scheduling ability, in addition the KVM virtualization technology strong isolation features make memory resources can not Shared between multiple middleware instances, these two factors make the running on the virtual machine middleware instances deployed density is very low, even if a tenant middleware load is not high, It is also impossible to free memory because KVM is strongly isolated.

Thirdly, its delivery is not flexible. It is bound to IaaS of netease, so we cannot support its commercialization and export to enterprises outside netease in the future. The infrastructure of such enterprises may be on the public cloud or in their OWN IDC room.

Thinking about middleware containerization

In recent years, container technologies such as Docker and Kubernetes have been born and developed rapidly, and the containerization of stateless applications has become mature. We believe that container, as a new infrastructure technology that has been widely implemented, perfectly matches the defects of the first-generation basic middleware — weak isolation contributes to resource sharing; Lightweight virtualization can eliminate performance loss and meet the requirements of heavy load scenarios. Standardized packaging based on image is conducive to efficient delivery; There is a strong flexible scheduling ability; Most importantly, it is a cornerstone of the entire cloud native stack.

Kubernetes choreography technology, the key thing is that it is loosely coupled to the infrastructure, allowing us to move the application anywhere because it is designed for the hybrid cloud. In addition, it is designed for mass production environment, inheriting the experience of Google’s mass production environment, so it is hopeful to use container technology to solve the problem of middleware servitization.

Netease internally builds a set of cloud native operating system based on Kubernetes, which can adapt to all kinds of infrastructure resources downward, and can serve as a unified provider of various application loads upward — which is also one of Kubernetes’ goals. Middleware is exactly the type of business that the entire cloud native operating system is supposed to support. From this point of view, middleware containerization also makes sense.

Middleware containerization to solve its operation and maintenance problems, especially the following requirements must be considered.

Firstly, life cycle management, we need the containerized middleware platform to help o&M complete various o&M operations at the middleware instance level. Netease Jufan will implement it based on the framework of Kubernetes Operator.

The second point is the deployment of high availability, middleware, especially in the pursuit of higher availability, often need to do the deployment of multiple rooms, a middleware cluster in accordance with what proportion of all instances distributed in different rooms, the standard Kubernetes scheduler can not do. We need to extend Kubernetes’ scheduler to implement this arrangement.

At the same time, it is necessary to improve the indicator for monitoring alarms, which corresponds to the observability system of cloud native Prometheus.

Performance is a pain point of first-generation middleware. We need to ensure that the performance of containerized middleware can basically reach that of physical machine deployment to support core applications, which requires targeted optimization of the performance of various middleware instances.

Also is the transition, because we want to middleware container can not only use in netease, commercialization can also output, so we refer to public clouds on RDS, Redis product shape, need to have the same product ability, can in any on the infrastructure of low cost, flexible delivery, we must adopt the architecture design of loose coupling and high reuse.

Netease Shifan chooses the mechanism of Kubernetes Operator. From a deeper understanding, Kubernetes builds a “primitive” required for the Deployment and maintenance of distributed systems. Its built-in objects such as Pod, Node, Deployment, StatefulSet, etc., are all proposed for the realization of a typical stateless distributed system. These built-in objects work together to make the deployment and operation of stateless applications very efficient.

But these objects built into Kubernetes do not directly solve the problem of intermediate deployment operations. First, middleware is stateful, and its state is storage, perhaps network IP. Second, the middleware instances and stateless application instance, a copy of which has nothing to do with each other, and middleware between instances and instances, there is a relationship between copy and duplicate, is to visit each other, forming a complex topology relation between middleware, such as when doing failover, Redis is a master-slave relationship between the two copies.

The community also started implementing middleware, or stateful applications, more than two years ago, coming up with an Operator development framework. If we understand Kubernetes as an operating system, then Operator is a development framework for developing native applications on top of this operating system, supporting a more efficient, automated, and extensible development approach.

Operator has four characteristics. First, it needs to be developed and follows a declarative programming philosophy, with object definition and controller deployment. An Operator is a controller that follows a closed-loop decision chain of observation, analysis, and action. If the user defines four resources, the Operator analyzes the inconsistency between the current state of the four resources and the target state.

As can be seen from the figure, there is 1 Pod in the current state, which is now the version of 0.0.1. The state we defined requires 0.02, and there is still a Pod missing. If any inconsistency is found, it will have some Action, and then expand another Pod to upgrade it to 0.0.2. When we implement Operator, we’re essentially writing what the actions should do. This encapsulates domain-specific operations knowledge and experience and can be designed to manage complex state applications.

The main body of Operator development framework includes three parts. The first part is operator-SDK, which is a scaffold for development. The second part is operator-lifecycle Manager, a lifecycle management component; The third part is OperatorHub.io. Since anyone can develop an application for which they can deploy and install operations, they should be able to put it in an application marketplace, and OperatorHub.io is one of those marketplaces.

Different organizations develop operators. In the view of o&M, there is a certain level of maturity, and application deployment can be automated o&M, which is the most desirable level for o&M. The first level is the basic Operator installation. How to implement the original installation and deployment script in Operator engineering mode.

This is a middleware platform architecture based on Kubernetes Operator implemented by netease Shufan, including control surface and data surface. The left control side is the ability to manage operations and maintenance, including some common components that are not related to the middleware business but are required by everyone, such as auditing, authentication authority, console, etc.

In the middle is the middleware Operator, where we use the mechanism of Operator to develop middleware such as Redis, Kafka and MySQL.

We have implemented the lifecycle management of the middleware. The operators themselves are also running on Kubernetes and it is a stateless application that can be run in Deployment mode because its states are stored in ETCD.

Next is the Kubernetes management plane, the components required by the Master node.

At the bottom is the log, monitoring, alarm components, we developed a log management platform from the collection of information to dynamically update its configuration, as well as the log collection.

On the right is the data surface of middleware. I have drawn three nodes. We use StatefulSet to realize a cluster of middleware, each instance runs on a Pod, and each Pod may declare its use for persistent volumes. They need to synchronize data and topology with each other for state change and fault recovery. Each node runs two components of Kubernetes, Kublet, kube-proxy, and a collector for log monitoring.

We also realize Pod hanging disk function, no matter local site or remote disk, through the way of StorageClass to achieve, which is also Kubernetes standard.

Common problems and solutions of middleware containerization

Next, some common problems in the process of middleware containerization are discussed. The biggest characteristic of middleware is that it is stateful. Kubernetes is only responsible for the orchestration of computation. The state storage of middleware has two possibilities, one is remote storage and the other is local storage.

We consider remote storage to be a best practice. If you have a set of remote distributed storage similar to open source Ceph on a private cloud environment, you should not hesitate to use it for storage. If Ceph is not performing well, you can find a better distributed storage to use directly. If you are in the public cloud, you should not hesitate to use cloud disks as middleware storage.

In many cases, local storage is a last resort, because there is no reliable distributed storage, it is possible that the distributed storage performance is not good, and run far from the local site, it is also possible that the backend reliability of the distributed system is not good, will lose data.

To this end, we have implemented local storage access. We have two requirements for local storage. One is to ask Pod to do a good job in dynamic management configuration when applying for PVC, and to do the corresponding operations when creating and deleting the local site. At the same time, when Pod is scheduled, to achieve a strong binding with the local site, since Pod is created on a Node, you must ensure that Pod still runs on that Node after failure recovery or rescaling, to ensure that the middleware data is not lost.

In terms of technical implementation, we have introduced a LVM for dynamic management of Local disks on nodes, and also adopted Kubernetes Local PV. The disadvantage of the latter is that it requires operation and maintenance to create PV on nodes in advance, which is not desirable. So we did two things, one is the scheduler extension, to achieve local storage resource preparation, when creating Pod declare the required local site size, it can dynamically create mount to the Pod, no need for operation and maintenance manual preparation in advance.

In the scheduling process of a Pod shown in the figure, the user creates a Pod, which declares a PVC, we add a local storage scheduler extension, first do a pre-scheduling, calculate whether the storage capacity of the local site on each Node is enough, if so, put the Node information into the PVC. Then inform a local storage resource preparer on the Node to call LVM to create the storage resource and the corresponding PV when the resource preparer receives the request. Bind the PV to the PVC on the resource preparer, and then inform the scheduler that the Pod can be scheduled to this node because the declared local storage is ready. Next, use Kubernetes to mount the local site where the node is located to Pod to complete an overall scheduling.

There are two scenarios for implementing middleware containerized networks. In the first scenario, the middleware we designed runs on different infrastructures, corresponding to different network configurations. If it is a physical network, network schemes such as Calico and Flannel can be used to directly use its CNI. If it is a public cloud, connect to the VPC network on the public cloud. The advantage is that each public cloud provides a standard CNI for Kubernetes, so that Kubernetes running on the cloud host can access their network.

In the second scenario, we need to optimize network performance. We introduce a container SR-IOV scheme, which has the advantage of achieving lower latency than physical machines. It is implemented by the network card passthrough technology, which can reduce the delay by 50%. It can meet the requirements of some ultra-high performance tasks that require high delay, but cannot improve PPS. Pass-through reduces the virtualization overhead of network transmission, but has obvious disadvantages. This scheme can only be used on physical networks because it is completely dependent on hardware network cards and cannot be used on public clouds to achieve network acceleration.

To deal with heterogeneous network cards in a physical network environment, including Intel network cards and Mellanox network cards, VF (a concept of SR-IOV) needs to be carefully managed. We treat VF as an extended scheduling resource, using the standard Kubernetes Device Plugin to discover and register the node’S VF resources, combined with the label and TAINt tag, the native scheduler can manage and allocate resources.

Each instance is a Pod of StatefulSet. StatefulSet can only keep the name of Pod unchanged when it is updated from one update to another, or when it is suspended and then resumed. But it can’t keep the Pod’s IP unchanged. However, in the eyes of traditional middleware operation and maintenance, the IP deployed on physical machines is unchanged, and the original IP will remain after the machine restarts. Therefore, some of their operation and maintenance habits prefer IP rather than domain name.

In order to promote the implementation of containerized middleware faster and take into account the existing applications, we made the function of keeping the IP of StatefulSet unchanged by introducing a global container address pool component to take over the allocation of Pod IP. When StatefulSet is created, keep a record of the IP assigned to it, even if it is deleted during Pod updates. When StatefulSet is rebuilt, if the name is the same as the original IP, assign it to it again.

Engineering, we develop container middleware, compared with the first generation of virtualization based middleware, because the reuse of Kubernetes built-in concepts and some of its mechanisms in operation and maintenance, control, so that we can develop the same basic middleware, research and development costs can be greatly reduced. This is a lot less code than the first generation of basic middleware, but it comes at a cost — the developer must know the Kubernetes Operator framework very well, and must understand the Kubernetes concept of declarative programming very well to write it.

In terms of quality assurance, we did two things. The first is chaos test, which is failure test. ChaosBlade is based on open source to simulate the impact of Kubernetes resource failure on middleware services. We also use the Kubernetes E2E testing framework to ensure that operations personnel can simulate the normal life cycle operations of various middleware instances.

In many cases, its UI has something in common and the use mode of UI is the same. This is a front-end page rendering that we designed. The rendering engine enables the console to be developed quickly with dynamic form mechanism. The back-end can be configured to realize the development capability of the console business, which makes the development cost smaller.

Performance optimization. We have adopted some strategies to make the performance of containerized middleware almost as good as it would be on a physical machine. We turned on performance mode in the CPU to reduce wake up latency. On the memory side, we turned off SWAP and transparent large pages and tuned the synchronous memory dirty page write back threshold, all at the parameter level.

Enable the I/O kernel BLK-MQ to increase the prefetch cache. There is also a more important network adapter interrupt, we will physical method interrupt and container veTH virtual network adapter interrupt processing and CPU to isolate, to ensure that the system performance does not jitter.

NUMA is also one of our optimizations, and this is particularly evident with high loads. We made the container deployment aware of the NUMA topology, allocating pods to local NUMA as much as possible, and trying not to cross a Pod across NUMA to avoid the high CPU cache overhead.

One drawback of the first generation of middleware was that it could not be delivered externally. Last year we did a product called containerized middleware, called Light Boat Middleware, which has the standard capabilities of basic middleware. In the access layer we also add some capabilities, because we do it based on Kubernetes, operations personnel can even operate and maintain middleware through Kubectl, YAML files. In the middleware service layer, we have implemented 7 basic middleware services, which basically have the core operation and maintenance capabilities mentioned above.

On the whole, the middleware based on Operator can run in any Kubernetes cluster, the underlying resources do not matter, the public cloud virtual machine can be used as Kubernetes Node, cloud disk can be used as Kubernetes storage. In addition, we also allow some middleware developed by the community based on Operator to run on our platform.

future

Technology serves for business. The biggest pain point of middleware is operation and maintenance, which should be solved by managed cloud service. The advantage of container technology makes middleware container become the best practice to realize middleware cloud service. Operator is needed in the realization, and a more cloud-native mode is needed to develop the container middleware. Of course, the requirements for developers are also very high.

First, our current containerized middleware platform can run on any Kubernetes, but we will run on Kubernetes distributions like OpenShift, Rancher, etc. It is expected that container middleware operators will also run on it, but some compatibility is required. Second, we want to build a cloud native operating system. Middleware is one of the loads. Why don’t I mix the load of middleware with the load of stateless applications? This can bring a higher resource utilization rate to the company and reduce costs.

Thank you!