If you play Go, the Goroutine scheduler in Go uses the classic GPM model. In short, Go tries to create a P(logical processor) that matches the number of cpus in the system. The advantage of keeping these two numbers consistent is that each CPU is kept busy and the system resources are maximized. If P is too low, the CPU will be lazy; If there is too much P, the CPU will die.

Go provides us with the following function interface to configure this value, generally we do not need to set it manually. However, there are some circumstances in which human flesh configuration is necessary, as will be mentioned later in the article:

runtime.GOMAXPROCS(n int)
Copy the code

If we don’t manually set the number of P, the Go program will automatically set this value by getting the number of available cpus in the system at startup. So, getting back to the topic of this article, do you know how Go gets the number of available cpus in the system?

/proc/cpuinfo = /proc/cpuinfo = /proc/cpuinfo = /proc/cpuinfo = /proc/cpuinfo = /proc/cpuinfo Or lscpu, or sysconf?

conclusion

In case you don’t want to be distracted, Go directly issues a kernel call in assembly mode: sys_sched_getaffinity(), which gets the CPU count available to the current process. The general process is as follows:

If you are interested in other platforms, you may wish to go there yourself. Golang has a different implementation for each platform.

If we end up with a system call, this gives us a new way to get the number of cpus in the system, since it has become language independent. So I also want to take a closer look at how Go uses this system call implementation to get the total NUMBER of cpus. I also want to clarify that the CPU value obtained in this way is inaccurate in the container platform (K8S)!! (See the second half of this article).

Here we focus on analyzing the specific code logic of the last three steps of the above process.

Getproccount cuts

Let’s start with the getProcCount () function entry, which is not much code, and post it directly:

func getproccount(a) int32 {
	const maxCPUs = 64 * 1024
	var buf [maxCPUs / 8]byte
	r := sched_getaffinity(0, unsafe.Sizeof(buf), &buf[0])
	if r < 0 {
		return 1
	}
	n := int32(0)
	for _, v := range buf[:r] {
		forv ! =0 {
			n += int32(v & 1)
			v >>= 1}}if n == 0 {
		n = 1
	}
	return n
}
Copy the code

As you can see from the code above, another sched_getaffinity() function is called inside the getProCount function to geta data set, perform a bunch of calculations, and return the final value: n. This n is the number of cpus, and this is the whole getProCount function, but to understand this code, we need to keep going, otherwise we don’t know why we’re doing this.

About the sched_getaffinity function

When I ran the sched_getaffinity() function again, I saw that it was implemented directly in assembly code (sys_linux_amd64.s) :

TEXT Runtime ·sched_getaffinity(SB),NOSPLIT,$0 MOVQ PID +0(FP), DI MOVQ len+8(FP), SI MOVQ buf+16(FP), DX MOVL $SYS_sched_getaffinity, AX SYSCALL MOVL AX, ret+24(FP) RETCopy the code

The assembly code is not complicated, but at its core generates a system call: MOVL $SYS_sched_getaffinity, AX. Once you understand the system call function, the logic of the getProcCount function is solved. Let’s focus on SYS_sched_getaffinity.

SYS_sched_getaffinity and CPU affinity

Before moving on to the SYS_sched_getaffinity function, let’s talk about another topic: CPU affinity. This word is also in the name of the function we are analyzing, so it must have something to do with it.

This article won’t go into the details of CPU affinity. I’ll tell you what we can tell you about CPU affinity:

  1. By default, the process has affinity for all cpus in the system.
  2. Linux provides us with interface functions to manipulate CPU affinity.
# include/linux/syscalls.h# Set the affinity masklong sys_sched_setaffinity(pid_t pid, unsigned int len,unsigned long __user *user_mask_ptr); # getlong sys_sched_getaffinity(pid_t pid, unsigned int len,unsigned long __user *user_mask_ptr)
Copy the code

A server process can use all of the CPU resources on the system without any setup. On the other hand, can we use this method to calculate the amount of CPU available for a process?

Well, that’s what Go does. The sys_sched_getaffinity function retrieves the value of the system CPU bitmask available to the current process, which corresponds to the system logical CPU. For example, if the current system is 4-core, 8-GIGAByte, this mask bit will return 1111 by default. Note that this is a binary bit, with 1 indicating yes and 0 indicating no. If the system also uses hyperthreading, the mask value is eight mask values: 11111111.

Then we can just count the number of ones in the return list to see how many cpus we can always use; To get this bitmask value we initialize an array space, in Go, we use a byte array

const maxCPUs = 64 * 1024
var buf [maxCPUs / 8]byte
Copy the code

In order to store enough bitmask values, the initial space should be large enough that when the function is run, buF will get an array of bytes containing the following bitmask values, rendered in byte binary form, resulting in values like this:

[1111.0000.0000. ]// Logical CPU 4 cores
Copy the code

Getproccount () = getProcCount () = getProcCount () = getProcCount () = getProcCount ()

func getproccount(a) int32 {
	// ...
	for _, v := range buf[:r] {
		forv ! =0 {
			n += int32(v & 1) // check if it is 1, if it is +1, if it is 0, n+0=n
			v >>= 1 // Move right to calculate the next digit}}// ...
	return n
}
Copy the code

Following this thread, we can also implement a version of the CPU count directly in C, but you can also extend this to other languages:

#define _GNU_SOURCE
#include <assert.h>
#include <sched.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

void print_affinity(a) {
    cpu_set_t mask;
    long n;
    if (sched_getaffinity(0.sizeof(cpu_set_t), &mask) == - 1) {
        perror("sched_getaffinity");
    }

    int r = sizeof(mask.__bits) / sizeof(mask.__bits[0]);
    for (int i = 0; i < r; ++i) {
        int v = mask.__bits[i]
        while(v ! =0){
            n += v & 1;
            v >>= 1; }}printf("cpu total: %d \n", n);
}

int main(a) {
    print_affinity();
    return 0;
}
Copy the code

Run results can also be obtained:

cpu total: 4
Copy the code

Of course, if it is C language in fact there are better encapsulation methods, here is not extended.

Problems in containers

After analyzing the CPU count principle, we know that Go is a trick to get the number of available cpus in the system, but unfortunately in the container environment, this value gets the wrong value!!

If you don’t believe me, you can try it yourself. Let’s say you have a container with 40 cores and you’re actually running a pod with 4 cores. If you use the default Go mode, you get 40 cores instead of 4, you can verify this using runtime.numcpu ()

printf("cpu total: %d \n", runtime.NumCPU()) // cpu total: 40
Copy the code

In other words, the host CPU is 40, but only 4 cores are actually working, far more than the system resources. This will cause other PS to wait and reassign, and eventually at the program level you will find that some of the Goroutine execution will stall. If it’s a Web service, you’ll find that part of the response to client requests is inexplicably slower than other Goroutines, and then you just don’t know what the problem is (experiencing the event firsthand).

How to solve it?

So how do we solve this problem? Here we need another point:

We know that the CPU resource allocation behind the container is realized through the Cgroup mechanism, that is, the CPU, memory and other resources used by a POD in a container are allocated by the Cgroup in the form of configuration files, these files are stored in the system/SYS /cgroup/ CPU/corresponding subdirectory.

#/sys/fs/cgroup/ CPU /cgroup directory
cpu.cfs_period_us
cpu.cfs_quota_us
Copy the code

There are relative and absolute CPU allocation methods. Here we take one of them as an example. There are many configuration files.

cpu.cfs_period_us

This parameter sets the time interval at which available CPU resources are reallocated to the Cgroup in microseconds (µs, represented here as “US”). If a task in a Cgroup has access to a single CPU for 0.2 seconds every 1 second, set cpu.rt_runtime_us to 2000000 and cpu.rt_period_us to 1000000. The CPU. Cfs_quota_us parameter has an upper limit of 1 second and a lower limit of 1000 microseconds.

cpu.cfs_quota_us

This parameter can be set to the total amount of time that all tasks in a Cgroup can run at a certain stage (specified by cpu.cfs_period_us), in microseconds (µs, represented here by “us”). If you want a process to use 2 cpus, set cpu.cfs_quota_us to 2000000 and then set cpu.cfs_period_us to 1000000

— access.redhat.com/documentati…

If this doesn’t make sense to you, remember the conclusion: We can calculate the actual number of cpus that pod can use from the above two parameters:

Number of logical cpus = CPU.cfs_quota_us/cpu.cfs_period_usCopy the code

Given this value, we can reset the Go program at initialization (usually init) by using the interface method provided by Go to set P:

init() {
  // Actual number of CPU cores = cpu.cfs_quota_us/cpu.cfs_quota_us
  runtime.GOMAXPROCS()
}
Copy the code

This is also the core logic implementation of uber library: pkg.go.dev/go.uber.org… Of course, you can use it directly.