Background knowledge recommended for this article:

  1. The fundamentals of Kubernetes and the responsibilities of each component;
  2. The basic concepts of Serverless computing and its advantages;
  3. Plus: Basic understanding of community Knative projects;

This article was shared by Yitao Dong and Ke Wang at KubeCon NA 2019.

Yitao Dong is a product manager of Ant Financial. He is committed to driving cloud computing related products, including cloud native PaaS platform, container and Serverless products, and works closely with end customers to help them adopt land-based cloud native related solutions in large-scale financial scenarios.

Wang Ke, Ant Financial, software engineer, construction of enterprise Serverless products based on Kubernetes/Knative, early user of Knative, member of Kubernetes community, early maintainer of control surface flow control, Long-term commitment to optimize and implement cloud native technology in an innovative way.

Share summary

Knative is a Google-led Serverless platform based on Kubernetes and has a high reputation in the community. However, as a community project, Knative is primarily concerned with standards and architecture. Although it has advanced concept, it is far from being used in production.

In this presentation by KubeCon, Yin Xiu and Zhong Le from Ant Financial sofastack-Paas product technology team share with you the practice and transformation of Ant Financial’s Financial technology Knative: Build an excellent Serverless computing platform based on Knative, and analyze in detail how to use unique technology to solve the three major problems of performance, capacity and cost.

From the application scenario of Serverless computing, extract the real Use Case of customers, including public cloud, private cloud, industry cloud, etc., and describe the various uses of Serverless computing. After that we’ll introduce a solution for running the Knative platform on Kubernetes, detailing the issues that had to be overcome to make it productionable. At the end of the speech, these problems will be solved one by one to make a better Knative platform than the community version.

Ii. Solving performance problems:Preheat the pool with Pod

Those of you familiar with Kubernetes probably know that performance is not the primary goal of Kubernetes.

In a large Kubernetes cluster, it is slow to create a new Pod and get it running. This is because the whole link is long: It sends a POST request to APIServer, waits for the Scheduler to receive an event that a new Pod resource has been created, and then waits for the Scheduler to run the filter and optimization algorithm on all nodes and return the scheduling results to the APIServer. The selected Node’s Kubelet receives the event, the Docker image pulls, the container runs, and the new container passes the security check and registers with the Service Mesh.

Any of these steps can have delays, lost events, or failures (such as scheduling out of resources is common). Even if everything is working properly, it is not uncommon for the entire link to lag up to 20s in a large Kubernetes cluster.

This leads to the dilemma of Serverless on Kubernetes: one of the features of Serverless is automatic scaling, especially from 0 to 1, and does not occupy any resources when not in use. However, if a user runs his own website/back end with Serverless, but the user’s first request takes 20 seconds to succeed, this is unacceptable.

To solve the cold start performance problem, our team came up with a creative solution: a Pod Pool.

Our product pre-creates many pods and gets them up and running. When Kubernetes’ controller wants to create a new Pod, instead of creating a new Pod from scratch, we find an eligible Pod that is on standby and inject code into the Pod and use it directly.

During the talk, we shared some technical implementation details, such as how to create a CRD and fork Kubernetes’ ControllerManager to implement the new Workload at a low cost; How to automatically scale the water level of Pod pool based on historical usage data; How to do code injection etc. We suggested three ways to do this, namely instructing the Container process to download and execute the code package, using Ephemeral Container, and changing Kubelet to allow the Container to be replaced.

The actual implementation is more complex than this, with more issues to consider, such as how to respond to different resource requests and limits in Pod. We actually implemented a scheduler as well. When a preheated Pod can not meet, will look at the Pod on the Node resource allowance, if the allowance is enough dynamic change Kubernetes control surface data and cgroups, to achieve “vertical expansion”.

In practice, this cold-start optimization proved to be very effective, with the Pod size fixed and the code package cache present, the time to start the simplest HTTP server type application was optimized from nearly 20 seconds to 2 seconds, and the stability from 0 to 1 was greatly improved because there was no need to schedule the Pod on the spot.

This optimization skips several API Server interactions, Pod Schedule processes, and Service Mesh registration processes, greatly improving the user experience from zero to one without incurring too much extra cost. In general, 10-20 additional pods are reserved to handle most cases. For rare short application traffic surges, the worst case is fallback links to the original newly created PODS.

Pod preheating pools can be used not only for cold start optimization, but also for many other applications. In my talk, I called for the standardization of this technology to address Kubernetes’ data side performance issues. After the meeting, an audience member suggested that CNCF/WG-Serverless might be interested in doing this.

Iii. Cost reduction:Share control surface components

On the cost side, we shared with you multi-tenant remodeling and other ways to reduce costs.

Running the Community version of Knative in a single-tenant manner is expensive: The Kubernetes control surface and Service Mesh control surface need to be deployed because these are Knative’s dependencies and the Knative control surface itself is resource-intensive. Tens of gigabytes of machines are used without any business value. Therefore, it is necessary to share components of these control surfaces.

By sharing, users no longer have to pay for infrastructure alone. The cost of the control surface is related only to the total number of applications created by each tenant, not to the number of tenants.

We recommend two sharing methods, one is Namespace isolation + RBAC permission control, this control plane sharing method is the simplest, Kubernetes native support, and also a widely used method. Another method is the multi-tenant scheme of Kubernetes developed by Ant Financial’s fintech. By adding one level of directory in ETCD and storing each user’s data in their own directory, a truly all-round multi-tenant Kubernetes can be realized.

There were other ways to cut costs, For example, Virtual Kubelet is used to connect with ECI of Ali Cloud (on-demand container service), Cluster AutoScaler is used to automatically release Kubernetes nodes with low usage, or ECS is purchased from Ali Cloud to add new nodes for horizontal expansion, etc. It also briefly mentions security issues that may arise when containers with multiple tenants share the same host, such as Docker escape. One possible solution is to use Kata Container (virtual machine) to avoid sharing the Linux kernel.

Iv. Solve the capacity problem:Sharding is well supported at each level

The volume challenge is that as the Workload increases, both the Knative controller/data surface components, the Kubernetes control surface itself and the Service Mesh will face greater pressure. This is not a difficult problem to solve, as long as sharding is supported at every level, from top to bottom.

The upstream system creates a shard ID for each APP, and the downstream can deploy multiple sets of control plane components, with each set of components handling a shard ID.

To fully support sharding, we need to transform each controller on the control plane, Knative Activator on the data plane, and Service Mesh.

Controller modification is very easy, just add LabelSelector to Informer with a shard ID and the controller will only see all resources under that shard ID and automatically ignore other resources. By setting non-overlapping LabelSelectors for each set of controllers, we can run multiple sets of non-interfering controllers at the same time. Because controller reconciliation is stateless and idempotent, we can still deploy multiple replicas for each shard ID in a master-master manner for high availability.

The next step is the transformation of the Data surface Activator, whose main challenge is how to find the AutoScaler corresponding to each application (because autoScalers are also fragmented and deployed in multiple copies). In this case, the domain name can be used to address the fragment ID as part of the domain name. Then, with DNS records or Service Mesh, the packet from the Activator can be routed to the AutoScaler of a fragment.

By default, each Sidecar in the Service Mesh contains the information of other pods, so the data volume of a Mesh containing N pods is O(n). Using ServiceGroup, we can divide a Service Mesh into multiple sub-service meshes and set only sidecars in each sub-service Mesh to be visible to each other to solve the problem that data volume increases rapidly as the scale increases. Naturally, each sub-service Mesh requires a separate Ingress, but there is an advantage to this: the pressure on each Ingress is not too high. If you want to access the Ingress of a sub-service Mesh, you can access the Ingress of another sub-service Mesh through the public IP address. Once all of this is done, within a single Kubernetes cluster, you can scale infinitely.

However, when the Workload reaches a certain extent, the Kubernetes control surface itself may become a bottleneck. At this point, we can deploy another Kubernetes and put some shards into that new Kubernetes cluster, which is a higher level of shards. As it happens, one of the hottest topics in KubeCon this year is multi-cluster: pack up your entire application, create a new Kubernetes Cluster and release it in full, unified operation and update…

V. Concluding remarks

This sharing lasted less than 40 minutes, with about 150 live viewers and 653 followers, and nearly 200 YouTube viewers so far. KubeCon is a really good technology conference with a lot of professionals. What is even better is the atmosphere of tolerance, freedom and sharing, which makes people feel that they are part of a big Community, making progress and innovation together and doing something great together.

In this KubeCon, we took some of the technical achievements of Ant Financial to share with you, and also experienced the new technologies and products that we are working on from the Community. Many of them are very good and worth learning from.

In the cloud native Serverless platform mode, we need to deal with a lot of scenes and problems to be solved, and the data scale is also growing. Welcome friends who are committed to the cloud native field to join us, and we will explore and innovate together!

Share video review: www.youtube.com/watch?v=PA1…

P.S. team is always hiring and welcome to transfer. Products, research and development, testing, BASE alipay Shanghai S space! Join us

Financial Class Distributed Architecture (Antfin_SOFA)