1. Borg System
Google's Borg system runs hundreds of thousands of tasks from thousands of different applications across multiple clusters with tens of thousands of machines per cell. It achieves high utilization through administrative control, efficient task packaging, overbooking, and process-level performance isolation. It supports high availability applications with run-time features that minimize fail-over time and dispatch policies that reduce associated fail-over probabilities. Here is the Borg system architecture diagram. Scheduler is responsible for task scheduling.
2. Basic introduction to K8S
When Docker container technology is being heated up, it is found that it is difficult to apply Docker to specific business implementation -- it is not easy to arrange, manage, schedule and other aspects. Therefore, people urgently need a management system to manage Docker and containers in a more advanced and flexible way. That's when the K8S appeared.
*K8S* is a container-based cluster management platform. Its full name is Kubernetes **
3. Main functions of K8S
Kubernetes is a tool used by Docker container for scheduling and management. It is a scheduling service based on Docker to build a container, providing resource scheduling, balanced disaster recovery, service registration, dynamic capacity expansion and reduction and other function suites. Kubernetes provides application deployment, maintenance, extension mechanism and other functions, using Kubernetes can be easily managed across the machine running container applications, its main functions are as follows:
Data volumes: Data volumes can be used to share data between containers in pods.
Application health check: Services in the container may be blocked and cannot process requests. You can set monitoring check policies to ensure application health.
Duplicate application instances: The controller maintains Pod replicas, ensuring that one Pod or a group of similar pods is always available.
Elastic scaling: Automatically scaling the number of Pod copies according to the set index (CPU utilization).
Service discovery: Use environment variables or DNS service plug-ins to ensure that programs in the container discover Pod entry access addresses.
Load balancing: A set of Pod replicas is assigned a private cluster IP address, and the load balancing forwards requests to the back-end container. Other pods within the cluster can access the application through this ClusterIP.
Rolling updates: Update services without interruption, one Pod at a time, rather than deleting the entire service at the same time.
Service choreography: Make application deployment more efficient by describing deployment services in files.
Resource monitoring: The Node component integrates the cAdvisor resource collection tool, which can summarize the resource data of the entire cluster by Heapster and then store it in the InfluxDB sequential database, which is displayed by Grafana.
Authentication and authorization: Supports attribute Access Control (ABAC) and role access control (RBAC) authentication and authorization policies.
4. K8s cluster architecture
This cluster consists of two main parts:
One Master node (Master node)
A group of nodes (compute nodes)
The Master node is still primarily responsible for administration and control. A Node Node is a workload Node that contains a concrete container.
1) Master node
The Master node includes API Server, Scheduler, Controller Manager, and ETCD. API Server is the external interface of the entire system, which is called by clients and other components. It is equivalent to "business hall". The Scheduler is responsible for scheduling resources within a cluster and acts as a "dispatching room". The Controller Manager is responsible for managing controllers.
2) Node node
Nodes include Docker, Kubelet, kube-proxy, Fluentd, kube-DNS (optional), and Pod
5, k8s master
5.1, API server
Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer: Kubernetes APIServer.
Etcd is a distributed, reliable key-value storage system used to store critical data in a distributed system. This definition is important.
Etcd is a third-party service, distributed key-value storage system. Used to maintain cluster state, such as Pod, Service, and other object information
Etcd is a highly available distributed key-value database. Raft protocol is adopted as consistency algorithm in ETCD, and ETCD is implemented based on Go language. Etcd is a very important component in Kubernetes cluster, which is used to save all the network configuration and object status information of the cluster. In the whole Kubernetes system, there are two services that need to use ETCD for coordination and storage configuration, respectively:
Etcd is also used to store the configuration information of flannel
2) Kubernetes itself, including the state and meta-information configuration of various objects
5.4, the scheduler
Select a Node Node for the newly created Pod based on the scheduling algorithm. Scheduler undertakes the important function of connecting the previous and the next in the whole system. On means that it is responsible for receiving the Controller Manager to create a new Pod and arrange a destination Node for it to settle; on means that after the placement is completed, the Kubelet service process on the target Node takes over the subsequent work.
In other words, the function of scheduler is to select the most suitable Node from the Node list for each Pod in the list to be scheduled through the scheduling algorithm.
6, k8s node
Kubelet manages Pods and their containers, images, volumes, etc.
Kubelet is the Agent of the Master on the Node. Each Node will start the Kubelet process, which is used to process the tasks from the Master Node to the Node and manage the life cycle of the local running container. For example, create container, Pod mount data volume, download Secret, get container and node status and so on. Kubelet transforms each Pod into a set of containers.
1, kubelet default listen on four ports, respectively 10250, 10255, 10248, 4194
10250 (Kubelet API) : Port used by the Kubelet server to communicate with Apiserver. It periodically requests the Apiserver to obtain tasks that it should process. Through this port, it can access and obtain node resources and status.
10248 (health check port) : Access this port to determine whether Kubelet is working properly. Use the startup parameters of Kubelet --healthz-port and --healthz-bind-address to specify the listening address and port.
4194 (cAdvisor Listening) : Through this port, kublet can obtain the environment information of the node and the status of the container running on the node. Visit http://localhost:4194 to see the management interface of cAdvisor. You can specify the port to start using kubelet's startup parameter -- cAdvisor-port.
10255 (Readonly API) : Provides pod and Node information. The interface is exposed as read-only. No authentication or authentication is required to access this port.
6.2, kube - proxy
Implementing Pod network proxy on Node Node, maintaining network rules and four-layer load balancing, Kube-proxy is essentially like a reverse proxy. We can think of the Kube-proxy running on each node as a transparent proxy and LB of service.
Kube-proxy listens to service and Endpoint information in apiserver, configures iptables rules, and directly forwards requests to POD through iptables
The engine that runs the container.
Pod is the minimum deployment unit. A Pod consists of one or more containers, which share storage and network and run on the same Docker host
1) Basic structure of POD
Pause containers are created on nodes, corresponding to pods.
Each Pod runs a special container called Pause, and the other containers are service containers. These service containers share the network stack of Pause containers and the Volume mount Volume, so communication and data exchange between them are more efficient. We can take advantage of this feature at design time by putting a set of closely related service processes into the same Pod. Containers in the same Pod can communicate with each other simply by using localhost.
Pause containers in Kubernetes provide the following functions for each business container:
PID namespace: Different applications in Pod can see the process ids of other applications
Network namespace: Multiple containers in a Pod can access the same IP and port range
IPC namespace: Multiple containers in a Pod can communicate using SystemV IPC or POSIX message queues
UTS namespace: Multiple containers in pods share a host name;
Volumes: Each container in a Pod can access Volumes defined at the Pod level
7. Other component
Ingress Controller is a B/S structure for the K8S cluster. Federation: provides a unified management function for K8S across cluster centers. Prometheus: provides monitoring capability for K8S clusters. ELK: provides a unified log analysis platform for K8S clusters
Ii. Principle of core components
To ensure that the number of copies of container applications is always at the user-defined number, that is, if a container exits unexpectedly, a new Pod is automatically created to replace it, and if there are too many exceptions, containers are automatically recycled.
ReplicaSet is recommended to replace ReplicationController in the new version of Kubernetes
ReplicaSet is not fundamentally different from ReplicationController except by name, and it supports integrated selectors
Although ReplicaSet can be used in isolation, However, it is generally recommended to use Deployment to automatically manage ReplicaSet without worrying about incompatibility with other mechanisms (e.g., ReplicaSet does not support rolling update but Deployment does).
Deployment provides a declarative definition method for Pod and ReplicaSet to replace ReplicationController for easy application management.
Typical application scenarios:
(1) Define Deployment to create Pod and ReplicaSet
(2) Rolling updates and rolling back applications
(3) Capacity expansion and capacity cable
(4) Suspension and continuation of Deployment
Not only can the Deployment be rolled back to update, but it can also be rolled back to version V1 if it is found that the service is not available after upgrading to version V2.
Horizontal Pod Autoscaling is applicable only to Deployment and ReplicaSet, expansion is supported only according to Pod CPU utilization in V1 version, and expansion is supported according to memory and user-defined metric in VLalpha version
StatefullSet is designed to solve the problem of stateful services (Deployments and ReplicaSets are designed for stateless services), and its application scenarios include:
(1) Stable persistent storage, that is, the same persistent data that Pod can access after rescheduling, is realized based on PVC
(2) Stable network flag, and Pod rescheduling, its PodName and HostName remain unchanged, based on Headlesss Service (i.e. Service without Cluster IP) to achieve.
(3) Orderly deployment, orderly extension, that is, Pod is sequential, the deployment or extension should be carried out according to the defined sequence (i.e. from 0 to N-1, all previous PODS must be Running and Ready before the next Pod runs). Based on Init Containers.
(4) Ordered shrinkage, ordered deletion (i.e. from N-1 to 0)
DaemonSet ensures that all (or some) nodes are stained (think of it as a label),pod will not be assigned to this node by the scheduler if pod is not defined to tolerate this stain)
Run a copy of Pod on Node. When nodes are added to the cluster, a Pod is added to them. When nodes are removed from the cluster, these pods are also reclaimed. Deleting DaemonSet will delete all the pods he created, using some typical uses of DaemonSet:
(1) Run cluster storage daemons, e.g. run glustered,ceph on each Node
(2) Run log collection daemons on each Node, such as Fluentd and Logstash.
(3) Run monitoring daemons on each Node, for example, Prometheus Node
Job is responsible for batch tasks, that is, tasks that are executed only once, and it ensures that one or more PODS of the batch task are successfully terminated
Cron Job management is based on time jobs.
- Run only once at a given point in time
- Periodically running at a given point in time
Data volume that shares data used by containers in pods.
Tags are used to distinguish objects (e.g., Pod, Service), key/value pairs exist; Each object can have multiple labels that associate objects with each other.Copy the code
Any API object in Kubernetes is identified by Label, the essence of Label is a series of Key/Value key-value pairs, where the Key and Value are specified by the user.
Labels can be attached to various resource objects, such as Node, Pod, Service, and RC. A resource object can define any number of labels, and the same Label can be added to any number of resource objects.
Labels are the basis on which Replication Controllers and Services run, and they use labels to associate pods running on nodes.
We can bind one or more different labels to a specified resource object to achieve multi-dimensional resource group management, which facilitates resource allocation, scheduling, configuration and other management work flexibly and conveniently. Some commonly used labels are as follows:
Version tags :" release":"stable","release":"canary"......
Environment tag :" environment":"dev","environment":" QA ","environment":"production"
Schema tags :" Tier ":"frontend"," Tier ":"backend","tier":"middleware"
Partition label :" Partition ":"customerA","partition":"customerB"
Quality control label :" Track ":" Daily "," Track ":" Weekly"
Label is the familiar Label. Defining a Label for a resource object is like enlarging it with a Label, and then you can use Label selectors to query and filter resource objects that have certain labels. In this way, Kubernetes implements a simple and generic object query mechanism similar to SQL.
The important usage scenarios of Label Selector in Kubernetes are as follows: - Kube-controller process defines Label Selector on resource object RC to filter the number of Pod copies to be monitored, so as to achieve automatic control process that the number of copies always meets the expected setting; - Kube-proxy process selects the corresponding Pod through the Label Selector of Service, and automatically establishes the request forwarding routing table of each Service island corresponding to Pod, so as to realize the intelligent load balancing of Service; Kuber-scheduler can implement Pod directional scheduling by defining specific labels for nodes and using Nodeselector in Pod definition files.
3. Service discovery
1, the service
1.1. What is Service
Service is an abstract concept. It maps to a specified port in the form of a virtual IP (VIPs) and is forwarded to one of a set of back-end Pods (endpoint) by requests from proxy clients
A Service defines a Pod logic set and the policies for accessing that set, and is an abstraction of a real Service. Service provides a unified Service access portal, Service proxy and discovery mechanism, and associates multiple PODS with the same Label. Users do not need to know how background PODS run. - Kubernetes IP: Node IP: Node IP: Pod IP: Pod IP: Cluster IP: Service IP address
- First,Node IP is the IP address of the physical network card of the nodes in the Kubernetes cluster. All servers belonging to this network can communicate directly with each other through this network. This also indicates that nodes outside the Kubernetes cluster must communicate via Node IP when accessing a Node within the Kubernetes cluster or TCP/IP services
- Second, Pod IP is the IP address of each Pod, which is assigned by the Docker Engine according to the IP address segment of the Docker0 bridge, usually a virtual layer 2 network.
Finally, Cluster IP is a virtual IP, but more like a fake IP network, for the following reasons: - Cluster IP is used only for Kubernetes Service, and is managed and assigned by Kubernetes. It does not have a "physical network object" to respond to - Cluster IP can only be combined with Service Port to form a specific communication Port, individual Cluster IP has no communication foundation, and they belong to the Kubernetes Cluster such a closed space. - Kubernetes within the Cluster, Node IP network, Pod IP network between the Cluster IP network communication, using Kubernetes designed a programming way of special routing rules.
1.2 service principle
VIP ** its realization principle is mainly by TCP/IP ARP protocol **. An IP address is only a logical address, and a MAC address is a physical address used for data transmission on the Ethernet. Each host has an ARP cache that stores the corresponding relationship between IP addresses and MAC addresses on the same network. The Ethernet host queries the MAC address corresponding to the target IP address from the cache before sending data to the MAC address. The operating system automatically maintains this cache, which is key to the entire implementation.Copy the code
2, the IPTables
The Iptables mode is the default proxy mode of Services. In the iptables proxy mode, kube-proxy does not act as a reverse proxy for load-balancing distribution between VIPs and Backend Pods. This work is delegated to iptables, which works at level 4. Iptables and NetFilter are tightly integrated and work together, both implementing packet forwarding at Kernelspace.
In this mode, Kube-Proxy mainly has the following steps to realize packet forwarding:
- Using watching Kubernetes cluster API, you can obtain the commands to create or delete Services or Endpoint Pod.
- Kube-proxy sets the iptables rules on node. When a request is forwarded to the ClusterIP of Services, it is immediately captured and redirected to a Pod of Backend corresponding to the Services.
- Kube-proxy sets iptables rules for each Pod of a service on node. The default algorithm for selecting Pod is random policy.
In the Iptables mode, kube-proxy delegates traffic forwarding and load balancing policies entirely to iptables/ Netfiter. These forwarding operations are implemented in Kernelspace, which is much faster than userspace.
In iptables kube-proxy only does the role of watching API to synchronize the latest data information. Routing rule information and forwarding are done by Kernelspace's iptables and Netfiter. In userSapce mode, kube-proxy does load balancing. If a Pod is not required by the backend, Kube-Proxy can retry. In iptables mode, there are routing rules. The backend Pod to be forwarded does not respond and is not removed by K8S. As a result, requests forwarded to this Pod may time out. Therefore, it must be used together with K8S probes.
2.1. Load balancing
There are two modes of TCP load balancing using Iptables in Linux: random and polling
The statistic module support two different modes: To be deflected or skipped about. To be deflected or skipped about is to be deflected or skipped about based on a probability NTHCopy the code
2.2. Random mode
An example of how to implement iptables in LB mode is as follows:
There are three Servers in the system. We configure iptables for traffic balancing to access these servers.
# Balancing:Iptables -a PREROUTING -t NAT -p TCP -d 192.168.1.1 --dport 27017 -M statistic --mode random --probability 0.33 -j DNAT --to-destination 10.0.0.2:1234 iptables -a PREROUTING -t NAT -p TCP -d 192.168.1.1 --dport 27017 -m statistic --mode Random --probability 0.5 -j DNAT --to-destination 10.0.0.3:1234 iptables -a PREROUTING -t NAT -p TCP -d 192.168.1.1 --dport 27017 -j DNAT --to-destination 10.0.0.4:1234Copy the code
In the first rule, if --probability 0.33 is specified, it indicates that the rule has a 33% probability to be hit.
Rule 2 also has a 33% probability of hitting, as specified in the rule -- Probability 0.5. Then the probability of hitting is: 50% * (1-33%) =0.33
In the third rule, the --probability parameter is not specified, so it means that when matching goes to the third rule, it must be hit. At this time, the probability of going to the third rule is: 1-0.33-0.33 ≈ 0.33.
As can be seen from above, all three rules have the same chance of hitting. In addition, if we want to modify the hit ratio of the three rules, we can adjust the --probability parameter.
Assuming that there are n servers, n rules can be set to evenly distribute traffic to n servers, where the value of --probability parameter can be calculated by the following formula:
Where, I represents the number of rules (the number of rule 1 is 1). N represents the total number of rules/servers. P represents the parameter value of --probability in rule I, P =1/(N − I +1).Copy the code
Note: Because iptables rules are matched sequentially, from top to bottom, it is important to strictly order rules when designing iptables rules. Therefore, the order of the above three rules cannot be reversed, otherwise LB equalization cannot be achieved.
2.3. Polling method
There are two parameters in the polling algorithm:
N: indicates every n packets. P: indicates the PTH packetCopy the code
In the rule, n and p stand for: starting from the PTH packet, the rule is executed every n packets.
This may be a bit tricky, but let's go straight to chestnuts:
If there are three servers and three servers poll traffic packets, the rule configuration is as follows:
#every: matches the rule once for every n packets #packet: start with the PTH packetIptables -a PREROUTING -t NAT -p TCP -d 192.168.1.1 --dport 27017 -m statistic --mode NTH --every 3 --packet 0 -j DNAT --to-destination 10.0.0.2:1234 iptables -a PREROUTING -t NAT -p TCP -d 192.168.1.1 --dport 27017 -m statistic --mode NTH --every 2 --packet 0 -j DNAT --to-destination 10.0.0.3:1234 iptables -a PREROUTING -t NAT -p TCP -d 192.168.1.1 --dport 27017 -j DNAT --to-destination 10.0.0.4:1234Copy the code
3.1 what is IPVS
IPVS (IP Virtual Server) implements transport layer load balancing, commonly known as Layer 4 LAN switching, and is part of the Linux kernel.
IPVS runs on hosts and acts as a load balancer in front of a real server cluster. IPVS directs requests for TCP - and UDP-based services to real servers and causes real server services to appear as virtual services on a single IP address.
3.2. IPVS vs. IPTABLES
IPVS mode was introduced in Kubernetes V1.8 and went into beta in V1.9. IPTABLES mode was added in V1.1 and has been the default mode of operation since V1.2. Both IPVS and IPTABLES are based on NetFilter. The differences between IPVS mode and IPTABLES mode are as follows:
- IPVS provides better scalability and performance for large clusters.
- IPVS supports more complex load balancing algorithms (minimum load, minimum connections, location, weighting, etc.) than iptables.
- IPVS supports server health checks and connection retries.
As we all know, iptables is designed for firewall services in Linux without much of a performance impact for fewer rules. However, for a K8S cluster, there are thousands of Services, which are also forwarded to Pods, each with one iptables rule. For a cluster, there are a lot of Iptables rules per node, which is a nightmare.
IPVS can also address the large scale network forwarding requirements that are likely to be encountered, but IPVS has an advantage over Iptables in storing network forwarding rules using Hash tabels, and it works primarily in Kernelspace, reducing the overhead of context switching.
3.3. IPVS Load Steps
Kube-proxy and IPVS configure network forwarding in the following steps:
- Using watching Kubernetes cluster API, you can obtain commands to create or delete Services or Endpoint Pod, establish a new Service, call back the network interface, and build IPVS rules.
- In addition, Kube-proxy periodically synchronizes forwarding rules of Services and Backend Pods to ensure that failed forwarding can be updated and repaired.
- When a request is forwarded to the backend cluster, the IPVS load balancer is directly forwarded to backend Pod.
3.4. IPVS Load algorithm
IPVS supports several load balancing algorithms:
- Rr: polling
- Lc: indicates the minimum number of connections
- Dh: destination address hash
- Sh: source address hash
- Sed: minimum expected delay
- Nq: no queue is required
On node, the ipvs-Scheduler parameter specifies the startup algorithm of Kube-Proxy.