Discover the core technology of Kubernetes on stateful services

background

As Kubernetes has become the most popular solution of cloud native, more and more traditional services are migrated to Kubernetes from virtual machines and physical machines. Various cloud manufacturers, such as Tencent Self-research cloud, also promote business to deploy services through Kubernetes. Enjoy the benefits of flexible expansion, high availability, automatic scheduling, multi-platform support and so on brought by Kubernetes. However, most kubernetes-based deployments today are stateless. Why is it harder to container stateful services than stateless services? What are the difficulties? What are the respective solutions?

This article will combine my understanding Kubernetes, rich stateful service development, management experience, container, for you have state holder is analysed the difficult point and the corresponding solutions, hope that through this article, can help you understand the container of stateful services difficult points, and can be based on their own stateful service scenario can choose flexible solutions, We can efficiently and stably container stateful service and run it on Kubernetes to improve development operation efficiency and product competitiveness.

Stateful service containerization challenge

In order to simplify the problem and avoid excessive abstraction, I will take the commonly used Redis cluster as a specific case to explain in detail how to container a Redis cluster, and further analyze and expand the common problems in stateful service scenarios through this case.

Below is an overall architecture diagram of THE Redis cluster solution CODIS (referenced from the CODIS project).

Codis is a proxy-based distributed Redis cluster solution consisting of the following core components:

Zookeeper or ETcd is a stateful metadata store. It can be deployed on an odd number of nodes
Codis-proxy is a stateless component. It calculates the CRC16 hash value of the key and forwards the key to the corresponding back-end COdis-group based on the Preshard routing table saved in ZooKeeper/ETCD
Codis-group consists of a group of active and standby Redis nodes. One active and multiple standby nodes are responsible for data read and write storage
Codis-dashboard cluster control plane API service, through which you can add and delete nodes and migrate data
Redis-sentinel: a high-availability component of the cluster. It detects and listens to the existence of the active REDIS and initiates a standby switchover when the active REDis fails

So how can we container coDIS cluster based on Kubernetes, using Kubectl operation resources can be created, efficient management of coDIS cluster?

In the case of containerizing stateful services like CODIS, we need to resolve the following issues:

How to describe your stateful service in Kubernetes?
How do you choose the appropriate Workload deployment for your stateful service?
When the built-in Workload of Kubernetes cannot directly describe the business scenario, what kind of kubernetes extension mechanism should be selected?
How do I make security changes to stateful services?
How do I ensure that your stateful service Pod instances are scheduled to different fault domains?
How can stateful service instance failures heal themselves?
How to meet the high network performance requirements of containerized stateful services?
How to meet the high storage performance requirements of containerized stateful services?
How do I verify the stability of a stateful service after containerization?

In the following part, I systematically sorted out the technical difficulties of containerized stateful services with mind mapping. Next, I will explain containerized solutions for you from the above aspects.

The load type

The first problem is how to use Kubernetes-like APIS and languages to describe your stateful services.

Kubernetes abstracts various business scenarios in the complex software world, with built-in Workload types such as Pod, Deployment, and StatefulSet. What are the usage scenarios for each Workload?

Pod, the smallest unit of scheduling and deployment, consists of a set of containers that share network, data volumes, and other resources. Why is Pod designed to consist of a set of containers instead of one? In actual complex business scenarios, a business container cannot complete some complex functions independently. For example, you want to use an auxiliary container to help you download cold backup snapshot files and do log forwarding. Thanks to the excellent design of Pod, Secondary containers can share the same network namespace and data volume with your stateful containers such as Redis, MySQL, ETCD, zooKeeper, etc., to help the primary business container do this. This auxiliary container is called Sidecar in Kubernetes. It is widely used in logging, forwarding, service mesh and other auxiliary scenarios, and has become a Kubernetes design mode. The excellent design of Pod is derived from the summary and distillation of more than 10 years of Borg operation experience inside Google, which can significantly reduce the cost of containerizing complex business.

The business process has been containerized successfully through Pod. However, Pod itself does not have the features of high availability, automatic scaling, rolling update, etc. Therefore, Kubernetes provides more advanced Workload Deployment to solve the above challenges. It is used to describe stateless service scenarios, so it is particularly suitable for the stateless components in the stateful cluster we discussed above. Such as coDIS cluster proxy components.

So why is stateful not suitable for Deployment? The main reasons are that the Pod names generated by Deployment are variable, there is no stable network identity, there is no stable persistent storage, and there is no control over the order in the rolling update process, which is very important for stateful. On the one hand, stateful services communicate with each other through stable network identity is the basic requirement of high availability and data reliability. For example, in ETCD, a log submission must be confirmed and persisted by more than half of the nodes in the cluster. In Redis, master/slave synchronization relationship is established according to stable network identity. On the other hand, whether it is etCD or other components such as Redis, when a Pod is abnormally rebuilt, the business often wants its corresponding persistent data not to be lost.

To address the above pain points of stateful service scenarios, Kubernetes designed and implemented StatefulSet to describe such scenarios, providing each Pod with a unique name, a fixed network identity, persistent data storage, and an ordered rolling update publishing mechanism. Using StatefulSet, you can easily containerize stateful services such as ETCD and ZooKeeper.

Through Deployment and StatefulSet, we can quickly containerize the services in most realistic business scenarios, but the business demands are diversified, and their respective technology stacks and business scenarios are also different. Some hope to achieve Pod fixed IP for convenient and quick connection with traditional load balancing. Kubernetes wants to be able to update pods in place without rebuilding them during release, or to specify any Statefulset Pod update.

Extension mechanism

Kubernetes is designed to provide a powerful extension architecture, as shown below (quoted from Kubernetes blog), From Kubectl plugin to Aggreated API Server, CRD, custom scheduler, operator, Network Plug-in (CNI), storage plug-in (CSI). Everything can be expanded, fully empower the business, so that each business can be customized based on the Kubernetes extension mechanism to meet the specific scene demands.

CRD and Aggreated API Server

Kubernetes provides CRD, Aggreated API Server, Operator and other mechanisms to extend API resources and combine your domain and application knowledge when you encounter Deployment and StatefulSet cannot meet your needs. Automatic resource management and o&M tasks.

CRD is the CustomResourceDefinition, which is a built-in resource extension method of Kubernetes. It integrates Kube-APIextension-Server inside Apiserver. No need to run additional Apiserver in Kubernetes cluster, responsible for implementing Kubernetes API CRUD, Watch and other general API operations, support Kubectl, authentication, authorization, audit, However, it does not support customization of SubResource log/exec, and does not support custom storage. It is stored on the ETCD of Kubernetes cluster itself. If a large number of CRD resources need to be stored, the performance of Kubernetes cluster ETCD will be affected. It also limits the ability of services to migrate from cluster to cluster.

Aggreated API Server (aggregated ApiServer), such as metrics-server, uses this feature to divide a huge single ApiServer into multiple aggregated ApiServer by resource category. The expansibility is further strengthened, the new API does not rely on modifying Kubernetes code, developers write their own ApiServer deployed in Kubernetes cluster, And through apiservice resources, the group name of user-defined resources and apiserver’s service name are registered to Kubernetes cluster. When Kubernetes Apiserver receives a request for user-defined resources, Based on apiservice resource information, apiserver can be forwarded to a customized Apiserver, which supports Kubectl, configuration authentication, authorization, and audit, customized third-party ETCD storage, and customized development of other advanced features, such as subResource log/exec.

Overall, CRD provides simple, programmatically extensible resource creation and storage capabilities, while Aggreated API Server provides a mechanism for you to have more fine-tuning control over API behavior, custom storage, use of Protobuf protocols, etc.

Enhanced Workload

In order to meet the requirements of advanced features such as in-place update and designated Pod update, Tencent has provided corresponding solutions internally and in the community. Tencent has mass production environment tested StatefulSetPlus (not open source) and TKEStack TAPP (open source), and the community also has Ali’s open source project Openkruise, Pingcap also has a pilot advanced-StateFulSet project to address the StatefulSet designated Pod update issue.

StatefulSetPlus is designed to meet a large number of Kubernetes in Tencent’s traditional business. On the basis of compatibility with all features of StatefulSet, it supports in-place container upgrade, connects with TKE’s IPAMD component, realizes fixed IP, supports HPA, Support Node is not available, Pod automatic drift self-healing, support manual batch upgrade and other features.

Openkruise includes a series of Kubernetes enhanced controller components, including CloneSet, Advanced StatefulSet, SideCarSet, etc. CloneSet is a Workload focused solution for stateless service pain points. Support for in-place updates, Pod deletions, and Partition support during rolling updates. Advanced StatefulSet is an enhanced version of StatefulSet that supports in-place updates, pausing, and maximum unavailability.

With the enhanced Workload component, your stateful service now has the advantages of traditional VIRTUAL machine, in-place update in physical machine deployment mode, and fixed IP. However, do you want to directly containerize your service based on StatefulSetPlus, TAPP, etc., or do you want to define a custom resource based on Kubernetes extension mechanism to describe the components of your stateful service? And write custom operators based on workload of StatefulSetPlus, TAPP, etc.?

The former is suitable for simple stateful service scenarios, they have few components, easy to manage, and do not require you to know any Kubernetes programming knowledge, no need to develop. The latter is suitable for more complex scenarios, requiring you to understand Kubernetes programming mode, know how to customize extension resources, write controllers. You can combine your stateful service domain knowledge and write a very powerful controller based on the enhanced workload such as StatefulSetPlus and TAPP to help you complete a complex, multi-component stateful service creation and management work with one click, with high availability, automatic scaling and other features.

Operator based extension

In our coDIS cluster example above, you can choose to use Kubernetes CRD extension mechanism to customize a CRD resource to describe a complete CODIS cluster, as shown in the figure below.

Having described your stateful business objects declaratively through CRD, we also need to implement your business logic through Kubernetes’ operator mechanism. The core principle of Kubernetes operator is the controller idea. It obtains and monitors the expected state and actual state of the business object from API Server, compares the difference between the expected state and actual state, and performs the consistency tuning operation to make the actual state conform to the expected state.

The core of how it works is shown in the figure above (quoted from the community).

Get the initial state data (CRD, etc.) from kube-Apiserver through the List operation of the Reflector component.
Obtain ResourceVersion of resources from the List request return structure. Use the Watch mechanism to specify ResourceVersion to monitor data changes after the List in real time.
The event is received and added to the Delta FIFO queue for processing by the Informer component.
The Informer forwards the events in the Delta FIFO queue to the Indexer component, which stores the events persistently in a local cache.
Operator Developers can register callbacks for Add, Update, and Delete events through the Informer component. When the Informer component receives an event, it calls back to a business function. For example, in a typical controller scenario, each event is added to the WorkQueue, the operator’s coordinating Goroutine retrieves the message from the queue, parses the key, Read data from the local Cache maintained by the Informer mechanism using the key.
For example, when you receive the event to create a Codis CRD object and find that there are no Deployment/TAPP or other components related to the object running, you can create proxy services through the Deployment API, and create Redis services through the TAPP API, etc.

scheduling

After solving how to host your stateful service based on Kubernetes’ built-in workload and its extension mechanism description, the second problem you face is how to ensure that the “equivalent” Pod in stateful service is deployed across the failure domain to ensure high availability of stateful service?

How do you understand “equivalent” pods in the first place? In coDIS and TDSQL clusters, a group of primary and secondary instances of Redis/MySQL are responsible for processing the requests of the same data fragment, and high availability is achieved through the primary and secondary instances. Since master and standby instance pods are responsible for same-data shards, they are called equivalent pods, and production environments expect them to be deployed across fault domains.

Secondly, how to understand the fault region? Fault domain Indicates the potential fault impact range, which can be divided into host level, rack level, switch level, and availability zone level. A group of Redis master record instances should at least realize host-level high availability. If the node where the master instance of any shard is located fails, the standby instance should be automatically promoted to the master instance, and all the shards in the whole Redis cluster can still provide services. Similarly, in a TDSQL cluster, a set of MySQL instances should at least implement switch and availability level Dr To ensure that core storage services are highly available.

So how do you implement the equivalent Pod deployment across the failure domain described above?

The answer is scheduling. Kubernetes built-in scheduler according to your Pod resources and scheduling strategy, automatic allocation of Pod to the best node, it also provides a powerful scheduling extension mechanism, so that you can easily achieve customized scheduling strategy. In general, in simple stateful service scenarios, you can implement Pod deployment across failed domains based on the affinity and anti-affinity advanced scheduling strategies provided by Kubernetes.

Suppose you want to deploy a three-node ETCD cluster through containerization and high availability. The fault domain is the available zone, and each ETCD node must be distributed on different available zone nodes. How do we implement cross-availability deployment based on the affinity and anti-affinity features provided by Kubernetes?

Affinities and antiaffinities

You can simply add the following anti-affinity configuration to the WORKLOAD of etCD deployment to declare the Pod label of the destination ETCD cluster. The topology domain is the node available area and the hard affinity rule. If the Pod does not meet the rule, it cannot be scheduled.

How does the scheduler schedule when it encounters a Pod with an anti-affinity configuration added?

affinity:
  PodAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: etcd_cluster
          operator: In
          values: ["etcd-test"]
      topologyKey: failure-domain.beta.Kubernetes.io/zone
Copy the code

First, after the scheduler listens to the Pod to be scheduled generated by ETCD workload, it queries the node and available area information of the scheduled Pod through the label in the anti-affinity configuration. Then, in the screening stage of the scheduler, if the available area of the candidate node is consistent with the available area of the scheduled Pod, it will be eliminated. Finally, the nodes that enter the evaluation stage are all the nodes that meet the restrictions of Pod deployment across the availability zone. According to the evaluation strategy configured by the scheduler, an optimal node is selected and Pod is bound to this node. Finally, Pod deployment and disaster recovery across the availability zone are realized.

However, in coDIS cluster, TDSQL distributed cluster and other complex scenarios, Kubernetes own scheduler may not be able to meet your demands, but it provides the following extension mechanism to help you customize the scheduling strategy, to achieve a variety of complex scenarios scheduling demands.

Customize scheduling policies, extend scheduler, etc

You can start by modifying the predicates/predicates and scoring/priorities policies of the scheduler to configure scheduling policies that meet your business needs. For example, if you want to reduce the cost and support all the services in the cluster with the minimum number of nodes, we need to prioritize the PODS to the nodes that meet their resource needs and have more resources allocated. In this scenario, you can modify priorities and configure MostRequestedPriority policies to increase the priorities.

You can then implement extend scheduler based on Kubernetes scheduler. During the predicates and priorities stage of the scheduler, you can call back your extended scheduling service to meet your scheduling needs. For example, if you want to implement cross-node DISASTER recovery for a group of MySQL or Redis primary and secondary instances that are responsible for the same data shard, you can implement your own predicates function to remove the nodes with scheduled PODS from the candidate. Ensure that the nodes that enter the priorities process meet your business needs.

You can then implement your own separate scheduler based on Kubernetes’ scheduler. After deploying a separate scheduler to the cluster, you simply declare the Pod’s schedulerName as your separate scheduler.

scheduler framwork

Finally, Kubernetes introduced a new scheduler extension framework in version 1.15, which added hooks before and after the core flow of the scheduler. Select Pod to be scheduled, support custom queuing algorithm, Filter process provides PreFilter and Filter interface, scoring process adds PreScore, Score, NormalizeScore and other interfaces, binding process provides PreBind and Bind, PostBind three interfaces. Based on the new scheduler extension framework, services can control scheduling policies more finely and at lower cost, and customize scheduling policies more easily and efficiently.

High availability

With the scheduling issues resolved, our stateful service is ready for highly available deployment. However, high availability deployment does not mean that services can provide services externally with high availability. After containerization, we may encounter more stability challenges than traditional physical machine and virtual machine deployment. Stability challenges may come from runtime components such as operator, Kubernetes, Docker/Containerd, Linux kernel, etc. How to deal with the stability problems caused by these factors?

We should design Pod exceptions as normal cases, and after any Pod exception, we should have a self-healing mechanism in containerized scenarios. In case of stateless service, we only need to add reasonable survival and readiness checks for business Pod. When Pod is abnormal, it will be rebuilt automatically, and Pod will automatically drift to other nodes when node failure occurs. However, in the stateful service scenario, even if it carries the workload of stateful service and supports the Pod automatic drift function after node failure, it may fail to meet business demands due to the long SELF-healing time of Pod and data security. Why?

Suppose that in a CODIS cluster, a node where the primary Redis node is located suddenly “loses contact”. If the node waits 5 minutes before entering the self-healing process, the external unavailability will be 5 minutes, which is obviously unacceptable for important stateful service scenarios. Even if you shorten the self-healing time of the lost node, you cannot guarantee the data security. If the cluster network is cracked and the lost node is also providing services, multiple master double-writes will occur, which may eventually lead to data loss.

So what is the high-availability solution to stateful service security? Depending on the stateful services themselves highly available implementation mechanism, Kubernetes container platform layer is unable to provide a secure solution. Common high-availability solutions for stateful services include master/standby replication, decentralized replication, raft/ PaxOS and other consensus algorithms. Below, I briefly describe the differences, advantages and disadvantages of the three, as well as the precautions in the containerization process.

Lord for copy

Codis cluster cases and TDSQL cluster cases discussed above are based on the high availability of master/slave replication implementation, which is simpler than decentralized replication and consensus algorithm. Active/standby replication can be classified into active/standby full synchronous replication, asynchronous replication, and semi-synchronous replication.

In full synchronous replication, after receiving a write request, the master waits for all slave nodes to confirm the return before sending it back to the client successfully. Therefore, if a slave node fails, the entire system becomes unavailable. In this scheme, availability is sacrificed to ensure the consistency of multiple replica sets.

Asynchronous replication means that after receiving a write request, the master sends it back to the client in time and asynchronously forwards the request to each copy. However, if a fault occurs before the request is forwarded to the copy, data may be lost, but the availability is the highest.

Semi-synchronous replication is between full synchronous replication and asynchronous replication. After the master receives a write request, at least one copy receives data and sends it back to the client successfully, balancing data consistency and availability.

Stateful service based on active/standby replication mode. Services need to implement and deploy the HA service of active/standby switchover. HA service can be divided into active reporting and distributed detection based on implementation architecture. Active reporting The active/standby switchover of MySQL in TDSQL is used as an example. Agent is deployed on each MySQL node and reports the heartbeat to the metadata storage cluster (ZooKeeper/ETcd). If the master heartbeat is lost, the HA service rapidly initiates the active/standby switchover. The distributed detection type takes Redis Sentinel as an example. An odd number of sentinel nodes are deployed, and each sentinel node periodically detects the availability of the primary and secondary instances of Redis. Each sentinel node shares the detection results with each other through gossip protocol. Then one of the sentinels initiates the master/slave switch.

In general, stateful services based on active/standby replication rely on O&M to manually replace nodes when they fail in traditional deployment mode. For containerized stateful services, automatic replacement of faulty nodes and rapid vertical expansion of capacity can be realized by operator, which significantly reduces operation and maintenance complexity. However, RECONSTRUCTION of Pod may occur. Therefore, HA service for switchover between active and standby PODS should be deployed to improve availability. If the service is sensitive to data consistency, frequent switching may increase the probability of data loss. This can be reduced by using dedicated nodes, stable and newer runtimes, and Kubernetes versions.

Decentralized replication

The opposite of master-slave replication is decentralized replication, which means that in a cluster of N replica nodes, any node can accept write requests, but a successful write requires w nodes to confirm, and read must also query at least R nodes. You can set the appropriate W/R parameters based on the sensitivity of the actual business scenario to data consistency. For example, if you want any client to read a new value after each write, you can set w and R to 2 if n is three copies, so that when you read two nodes, one of them must contain the most recently written new value. This read is called a quorum read.

The chateau marmont AWS system is based on no centralized replication algorithm, it has the advantage of nodes are equal, reduce the operational complexity, higher availability, container lower difficulty, don’t need to deploy the HA components, etc., but defect is decentralized replication, be sure to lead to all kinds of conflict, conflict processing business needs attention.

Consensus algorithm

In order to ensure service availability, most databases based on the replication algorithm provide final consistency. Both master-slave replication and decentralized replication have certain defects and cannot meet the demands of strong data consistency and high availability.

How to solve the above replication algorithm dilemma?

The answer is raft/ PaxOS consensus algorithm, which was first proposed in the context of replication state machines, consisting of consensus modules, log modules, and state groups (as shown in the raft paper). The consistency of logs of each node is ensured through the consensus module. Then, each node executes instructions in sequence based on the same logs, and finally the results of each replication state machine are consistent. Here I take RAFT algorithm as an example, which consists of leader election, log replication and security. After the leader node fails, the follower node can quickly initiate a new leader election and ensure data security. After the Follwer node fails, as long as most of the nodes survive, It does not affect the overall availability of the cluster.

Stateful service based on consensus algorithm, typical cases are ETCD/ZooKeeper/TiKV, etc. In this architecture, the service itself integrates leader election algorithm and data synchronization mechanism, making operation and containerization complexity significantly lower than that of active/standby replication service, and containerization more secure. Even if the leader node fails due to a Bug in the process of containerization, data security and service availability are almost not affected thanks to the consensus algorithm. Therefore, it is recommended to containerization stateful services using the consensus algorithm.

A high performance

After achieving stateful service in Kubernetes more stable operation of the goal, the next goal is to pursue high performance, faster, and stateful service of high energy and rely on the bottom container network scheme, disk IO scheme. In traditional physical machine and VM deployment mode, stateful services have fixed IP addresses, high-performance Underlay networks, and high-performance local SSDS. How can stateful services achieve performance in traditional mode after containerization? I’ll briefly explain Kubernetes’ solutions in terms of networking and storage.

Scalable network solutions

The first is extensible, plug-in networking solutions. Thanks to Google’s years of experience and lessons in Borg container operation, in Kubernetes’ network model, each Pod has an independent IP, and each Pod can communicate across hosts without NAT. Meanwhile, Pod can also realize network communication with Node nodes. Kubernetes excellent network model compatible with the traditional physical machine, virtual machine business network solutions, so that the traditional business Kubernetes more simple. Most importantly, Kubernetes provides open CNI network plug-in standard, it describes how to assign IP for Pod and achieve Pod interoperability across node containers, each open source, cloud vendors can be based on their own business scenarios, the underlying network, to achieve high performance, low latency CNI plug-in, Finally, cross-node container communication is achieved.

In various Kubernetes network solutions based on CNI implementation, the implementation can be divided into underlay and overlay according to data packet sending and receiving mode. The former is directly based on the underlying network to achieve interconnection and good performance, while the latter is based on tunnel forwarding. It is based on the underlying network and tunnel technology to build a virtual network, so there is a certain performance loss.

Here, I take the open source Flannel and TKE cluster network schemes as examples respectively to elaborate their solutions, advantages and disadvantages.

In flannel, its backend supports udp, VXLAN, host-GW and other forwarding modes. The UDP and VXLAN forwarding modes are implemented based on overlay tunnel forwarding mode, which encapsulates original requests into UDP and VXLAN packets and forwards the requests to the destination container based on the Underlay network. Udp encapsulates data in user mode. With poor performance, udp is used for debugging and kernel versions that do not support the VXLAN protocol. Vxlan encapsulates data packets in kernel mode, reducing performance loss. In host-GW mode, IP routing information of each subnet is directly delivered to each node to implement inter-host Pod network communication without packet unpacking. Compared with UDP or VXLAN, the host-GW mode has the best performance, but the layer 2 network of each host node is connected.

In the TKE cluster network solution, we also support a variety of network communication solutions, through the evolution of three modes from Global Route, VPC-CNI to Pod independent network card. A global route is a global route. When a node is added to a cluster, it is assigned a unique Pod CIDR. Tke delivers a global route to the mother computer of the child VMS in the user VPC through the VPC interface. If the CONTAINER IP address or node IP address accessed by a user VPC belongs to the Pod CIR, the system matches the global routing rule and forwards the IP address to the target node. In this solution, THE Pod CIDR is not a VPC resource, so it is not a first-class citizen of the VPC and cannot use features such as VPC security groups. However, the Pod CIDR is simple and does not require any data packet decapsulation operations at the user VPC layer, resulting in no significant performance loss.

The TKE cluster implements vPC-CNI network mode to solve the problem that a series of VPC features cannot be used because the container Pod IP is not a first-class citizen of the VPC. The Pod IP belongs to the subnet of the user VPC. Cross-node container network communication and node-to-container communication work in the same way as the COMMUNICATION between CVM nodes in a VPC. At the bottom layer, routing and forwarding are implemented based on THE GRE tunnel in a VPC. Data packets are forwarded to the target container within a node using PBR. Based on this solution, container Pod IP can enjoy VPC features and realize a series of advanced features, such as CLB directly connected Pod and fixed IP address.

Recently, IN order to meet the requirements of game and storage services on container network performance, THE TKE team has introduced the next generation network solution, Pod exclusive elastic network card vPC-CNI mode, no longer through the node network protocol stack, greatly shorten the container access link and delay, and enable PPS to reach the maximum of the whole machine. Based on this scheme, we have implemented Pod binding EIP/NAT, which no longer depends on the external network access ability of nodes, and supports Pod binding security group to achieve Pod level security isolation. You can read the related article at the end of this article for details.

Based on the extensible network model of Kubernetes, businesses can realize high-performance network plug-ins for specific scenarios. For example, Tencent’s internal TENC platform implements sRIov-CNI CNI plug-in based on SR-IOV technology, which can provide Kubernetes with a high-performance layer 2 VLAN network solution. Especially for scenarios requiring high network performance, such as distributed machine learning training, game back-end services, etc.

Scalable storage solutions

After introducing scalable network solutions, another core bottleneck for stateful services is high-performance storage IO requirements. In traditional deployment mode, stateful services generally use local hard disks. Based on the service type, specifications, and external SLA, disks such as HDDS and SSDS can be selected. So how to meet the storage requirements in different scenarios in Kubernetes?

In Kubernetes storage system, this problem is divided into several sub-problems to solve gracefully, and has good scalability, maintainability, whether the local site, or cloud disk, NFS and other file systems can be based on its extension to achieve the corresponding plug-in, and realize the development, operation and maintenance responsibility separation.

So how is Kubernetes’ storage architecture built?

I introduce Kubernetes scalable storage system by using how to mount a data storage disk for your stateful Pod application as an example. It can be divided into the following steps:

How does an application apply for a storage disk? (Consumer)
How does Kubernetes storage system describe a disk? Do you manually create storage disks or do you automatically create storage disks on demand? (Producer)
How to match disks in a storage resource pool with requirements of storage disk applicants? (Controller)
How do I describe the type of storage disk, the data deletion policy, and the service provider information for this type of disk? (storageClass)
How to implement the plug-in for storing data volumes? (FlexVolume, CSI)

First of all, Kubernetes provides a resource named PVC, which describes the type of storage disk application, access mode, capacity specifications, for example, you want to apply for a CBS storage class for etCD service, the size of 100 GB cloud disk, you can create a PVC as follows.

apiVersion: v1
kind: PersistentVolumeClaim
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: cbs
Copy the code

Secondly, Kubernetes provides a resource named PV, which describes the type, access mode and capacity specifications of the storage disk. It corresponds to a real disk and supports manual and automatic creation of two modes. The following figure shows a 100 GB CBS hard disk.

apiVersion: v1
kind: PersistentVolume
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  persistentVolumeReclaimPolicy: Delete
  qcloudCbs:
    cbsDiskId: disk-r1z73a3x
  storageClassName: cbs
  volumeMode: Filesystem
Copy the code

Then, when the application creates a PVC resource, Kubernetes controller will try to match it with PV, whether the type of storage disk is consistent, PV capacity size meets the demands of PVC, if the match is successful, the STATE of the PV will become binding. The controller will further attach the storage resources corresponding to the PV to the node where the application Pod is located. After the attach is successful, the Kubelet component on the node will mount the corresponding data directory to the storage disk for reading and writing.

This is how to apply for a disk in a container. How to support multiple types of block storage and network file systems through PV/PVC? For example, block storage services support common HHD cloud disks, HIGH-PERFORMANCE SSDS, local SSDS, and remote network file systems support NFS. Secondly, how does Kubernetes controller dynamically create PVS on demand?

To support multiple types of storage, Kubernetes provides a StorageClass resource to describe a StorageClass. It describes the types of storage disks, binding and deletion policies, and what service components provide resource creation. For example, the high performance and basic MySQL services rely on different types of storage disks. You only need to fill in the corresponding storageclass name when creating the PVC.

Finally, in order to support the open source community and cloud manufacturers, Kubernetes provides a storage data volume extension mechanism, from the early in-tree built-in data volume, to FlexVolume plug-in, and now has GA containerized storage CSI plug-in mechanism. Storage service providers can integrate any storage system into the Kubernetes storage system. For example, storage CBS provisioner is TKE team of Tencent Cloud, we will create and delete CBS hard disks through Tencent cloud CBS API based on Kubernetes flexvolume/CSI extension mechanism.

apiVersion: storage.Kubernetes.io/v1
kind: StorageClass
parameters:
  type: cbs
provisioner: cloud.tencent.com/qcloud-cbs
reclaimPolicy: Delete
volumeBindingMode: Immediate
Copy the code

Kubernetes implements the LOCAL PV mechanism based on the PV/PVC storage system introduced above, which avoids network I/O overhead and enables your services to have higher I/O read and write performance. The local PV core abstracts the local site and LVM partition into PV, uses the Pod of Local PV, and relies on the delayed binding feature to achieve accurate scheduling to the target node.

The key technical points of local PV are capacity isolation (LVM, XFS quota), I/O isolation (Cgroup V1 generally needs to customize the kernel, cgroup V2 supports buffer IO, etc.), and dynamic provision. In order to solve the above or some of the pain points, A series of open source projects have been born in the community, such as TopoLVM (supporting dynamic provision and LVM), SIG-storage-local-static-provisioner and other projects, and local PV solutions have been developed by cloud vendors such as Tencent. In general, local PV is suitable for disk IO sensitive storage services such as ETCD, MySQL, and TIDB. For example, the TiDB project of PingCap recommends using local PV in the production environment.

The disadvantages of a Local PV are that data cannot be accessed, may be lost, and cannot be vertically expanded (limited by the disk capacity of the node) after a node fails. Therefore, the stateful service itself and its operator have higher requirements. The service itself needs to use the master/slave replication protocol and consensus algorithm to ensure data security. If any node fails, the operator can expand the capacity of a new node and recover data from the cold standby or leader snapshot. For example, the TiKV service of TiDB will automatically expand the capacity of new instances when the instance is abnormal and copy data through RAFT protocol.

Chaotic engineering

Through the above technical solutions, solve the load type selection, custom resource expansion, scheduling, high availability, high performance network, high performance storage, stability testing and a series of pain points, we can build stable, high availability, elastic scalability stateful service based on Kubernetes.

So how do you verify the stability of containerized stateful services?

The community provides several kubernetes-based chaos engineering open source projects, such as PingCap’s Chaos-Mesh, which provides Pod Chaos /Network Chaos /IO Chaos and other fault injection. Based on chaos Mesh, you can quickly inject Pod failures, disk I/O, network I/O and other exceptions into any Pod in a cluster, helping you quickly find stateful service bugs and operator bugs, and check the stability of the cluster. In the TKE team, we conducted investigation and reoccurrence of ETCD bugs based on chaos Mesh, and pressed the stability of ETCD cluster, which greatly reduced the difficulty of reoccurrence of complex bugs and helped us improve the stability of ETCD kernel.

conclusion

This article introduces how to use Kubernetes’ description and deploy your stateful service by selecting the workload selection and extension mechanism of each component in stateful cluster. Stateful service for its particularity, data security, high availability, high performance are its core goals, in order to ensure the high availability of services, scheduling and HA services can be implemented. With Kubernetes’ various scheduler extensions, you can deploy your equivalent Pod of stateful services across failed domains. You can use the active/standby switchover and consensus algorithm to automatically promote the standby node to the active node when the active node fails to ensure high availability. High performance mainly depends on network and storage performance. Kubernetes provides CNI network model, PV/PVC storage system and CSI extension mechanism to meet customized requirements in various business scenarios. Finally, the application of chaos engineering in stateful service is introduced. Through chaos engineering, you can simulate the fault tolerance of stateful service in all kinds of abnormal scenarios, and help you check and improve the stability of the system.

The resources

custom resources
Extend Kubernetes
tidb-operator
Tencent cloud container service TKE launched a new generation of zero-loss container network
chaos mesh