Scaling Kubernetes to 2500 nodes for problems and solutions

Kubernetes has claimed to be capable of hosting 5000 + nodes since 1.6, but there will inevitably be problems on the way from tens to 5000.

This article is to share the experience of the Open API on the Kubernetes 5000 journey, including the problems encountered, trying to solve the problem, and finding the real problem.

The problems encountered and how to solve them

Problem 1: after 1 to 500 nodes

Question:

Kubectl (P.S. kubectl -v=6)

Try to solve:

At first, I thought it was the load problem of Kube-Apiserver server, so I tried to add proxy as replica to assist in load balancing
However, the problem with more than 10 backup masters is not that Kube-Apiserver can’t handle the load, GKE can handle 500 nodes on a single 32-core VM

The reason:

Remove the above reasons and start to check the remaining services on master (ETCD, Kube-proxy).
Start trying to adjust etCD
Datadog: etCD throughput: latency spiking ~100 ms
The Fio tool performs performance evaluation. The Fio tool only uses 10% Input/Output Per Second (IOPS), which reduces performance due to write latency of 2ms
Try changing SSDS from network hard drives to local Temp Drives per machine (SSD)
The results range from ~100ms to > 200us

Problem two: ~1000 nodes

Question:

Found that Kube-Apiserver reads 500MB per second from etCD

Try to solve:

View network traffic between Containers through Prometheus

The reason:

Fluentd and Datadog grab data on each node too frequently
Lower the crawl frequency of the two services, and the network performance is reduced from 500MB /s to almost nothing
Etcd tip: Pass--etcd-servers-overridesThe data of Kubernetes Event can be written as cutting and processed by different machines, as shown below

--etcd-servers-overrides=/events#https://0.example.com:2381; https://1.example.com:2381; https://2.example.com:2381Copy the code

Problem 3:1000 to 2000 nodes

Question:

Unable to write data again, reporting Cascading failure
Kubernetes -ec2- autoScaler returns the problem after all etcds have stopped and closes all etcds

Try to solve:

The etCD disk is full, but the SSD still has a lot of space
Check to see if there is a preset space limit and find a 2GB size limit

Solutions:

Add to etCD startup parameters--quota-backend-bytes
Modify the kubernetes-ec2-autoScaler logic — if more than 50% of problems occur, shut down the cluster

Optimization of various services

High availability of Kube Masters

In general, our architecture is a Kube-Master (the main Kubernetes service provider component with Kube-Apiserver, KuBE-Scheduler, and kube-Control-Manager) plus multiple slaves. But to achieve high availability, consider the following implementation:

Kube-apiserver sets up multiple services and passes the parameters--apiserver-countRestart and set
Kubernetes-ec2-autoscaler allows idle resources to be automatically shut down, which is against the Kubernetes Scheduler principle.

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
  {"name" : "GeneralPredicates"},
  {"name" : "MatchInterPodAffinity"},
  {"name" : "NoDiskConflict"},
  {"name" : "NoVolumeZoneConflict"},
  {"name" : "PodToleratesNodeTaints"}
  ],
"priorities" : [
  {"name" : "MostRequestedPriority", "weight" : 1},
  {"name" : "InterPodAffinityPriority", "weight" : 2}
  ]
}
Copy the code

The above is the example of adjusting Kubernetes Scheduler by increasing the weight of InterPodAffinityPriority to achieve our purpose. More examples of demonstration references.

Note that currently Kubernetes Scheduler Policy does not support dynamic switching, kube-Apiserver (Issue: 41600) needs to be restarted

Adjust the impact of the Scheduler policy

OpenAI used KubeDNS, but soon found that —

Question:

DNS query failure occurs frequently (randomly)
More than ~200QPS domain lookup

Try to solve:

Try to see why there are more than 10 KuberDNS running on some nodes

Solutions:

Due to scheduler policy, many pods are concentrated
KubeDNS are lightweight and can easily be assigned to the same node, resulting in a centralized Domain lookup
POD affinity needs to be modified (described) to allocate KubeDNS to different nodes as much as possible

affinity:  
 podAntiAffinity:
   requiredDuringSchedulingIgnoredDuringExecution:
   - weight: 100
     labelSelector:
       matchExpressions:
       - key: k8s-app
         operator: In
         values:
         - kube-dns
     topologyKey: kubernetes.io/hostname
Copy the code

Problems with slow Docker image effects when creating nodes

Question:

Every time a new node is created, docker Image pull takes 30 minutes

Try to solve:

There is a large container Image Dota, about 17GB, which affects the image pulling of the whole node
Start checking kubelet for additional image pull options

Solutions:

Add options in Kubelet--serialize-image-pulls=falseTo start image pulling so that other services can pull earlier (see kubelet)Startup options)
This option requires the Docker storgae to switch to overlay2.
In addition, storing docker image on SSD can make image pull faster

Added: Source Trace

// serializeImagePulls when enabled, tells the Kubelet to pull images one // at a time. We recommend *not* changing the default value on nodes that // run Docker daemon with version < 1.9 or an Aufs storage backend. // Issue #10959 has more details. SerializeImagePulls *bool  `json:"serializeImagePulls"`Copy the code

Improved the speed of Docker image pull

In addition, you can increase the speed of pull in the following ways

Kubelet –image-pull-progress-deadline: 30mins docker daemon max-concurrent-download: 10

Network Performance improvement

Flannel performance limitations

The network traffic between OpenAI nodes can reach 10 to 15GBit/s. However, due to Flannel, the traffic is reduced to 2GBit/s

The solution was to remove Flannel and use the actual network

hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet

Here are a few more caveats to read in detail

Want easy-to-use, production-ready Kubernetes? Try Rainbond – Packaging Kubernetes as an application, easier to understand and use, and various management processes right out of the box!

Rainbond is an application-centric open source PaaS that deeply integrates kubernetes-based container management, Service Mesh microservice architecture best practices, multi-type CI/CD application construction and delivery, multi-data center resource management and other technologies. To provide users with cloud native application life-cycle solutions, build an ecosystem of interconnection between applications and infrastructure, application to application, and infrastructure to meet the requirements of agile development, efficient operation, and lean management required to support rapid business development.

Scaling Kubernetes to 2500 nodes for problems and solutions

Scaling Kubernetes to 2500 nodes for problems and solutions

The problems encountered and how to solve them

Problem 1: after 1 to 500 nodes

Problem two: ~1000 nodes

Problem 3:1000 to 2000 nodes

Optimization of various services

High availability of Kube Masters

Adjust the impact of the Scheduler policy

Problems with slow Docker image effects when creating nodes

Improved the speed of Docker image pull

Network Performance improvement

Related Posts

File handling in Python (part 1)

Use of cobra command line tool

Rui Gu: The road to building an open source enterprise Redis client