Linux Namespace is a method of kernel-level environment isolation provided by Linux. In official terms, Linux namespaces encapsulate global system resources in an abstraction that makes processes within a Namespace think they have separate instances of resources. It was the rise of container technology that put him back on the radar.

Linux Namespaces have the following six types:

classification System call parameters Related kernel versions
Mount namespaces CLONE_NEWNS Linux 2.4.19
UTS namespaces CLONE_NEWUTS Linux 2.6.19
IPC namespaces CLONE_NEWIPC Linux 2.6.19
PID namespaces CLONE_NEWPID Linux 2.6.24
Network namespaces CLONE_NEWNET Start with Linux 2.6.24 and finish with Linux 2.6.29
User namespaces CLONE_NEWUSER Started with Linux 2.6.23 and completed with Linux 3.8

The NAMESPACE API consists of three system calls and a series of /proc files, which are covered in detail in this article. To specify the type of namespace to operate, you need to specify the constant CLONE_NEW* in the system call flag (including CLONE_NEWIPC, CLONE_NEWNS, CLONE_NEWNET, CLONE_NEWPID, CLONE_NEWUSER and ` CLONE_NEWUTS), you can specify multiple constants, through | (or) operation.

Briefly describe the functions of the three system calls:

  • Clone () : Implements the system call of the thread, which is used to create a new process and can be isolated by designing the system call parameters described above.
  • Unshare () : Detaches a process from a namespace.
  • Setns () : Adds a process to a namespace.

See below for details of the implementation principle.

1. clone()


Clone () has the following prototype:

int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);Copy the code
  • Child_func: The program main function passed in to run by the child process.
  • Child_stack: The stack space used by the passed child process.
  • flags: Indicates what to useCLONE_*Mark.
  • Args: Used to pass in user parameters.

Clone () is similar to fork() in that it makes a copy of the current process, but clone() has more granular control over what resources are shared with the child process (in effect, through flags), including virtual memory, open file descriptors, semaphores, and more. Once the flag bit CLONE_NEW* is specified, a namespace of the corresponding type will be created and the newly created process will become a member of the namespace.

The clone() prototype is not the lowest level system call, but is encapsulated. The real system call kernel implementation function is do_fork(), which has the following form:

long do_fork(unsigned long clone_flags,
          unsigned long stack_start,
          unsigned long stack_size,
          int __user *parent_tidptr,
          int __user *child_tidptr)Copy the code

Where clone_flags can be assigned to the flags mentioned above.

Here’s an example:

/* demo_uts_namespaces.c Copyright 2013, Michael Kerrisk Licensed under GNU General Public License v2 or later Demonstrate the operation of UTS namespaces. */ #define _GNU_SOURCE #include <sys/wait.h> #include <sys/utsname.h> #include <sched.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> /* A simple error-handling function: print an error message based on the value in 'errno' and terminate the calling process */ #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ } while (0) static int /* Start function for cloned child */ childFunc(void *arg) { struct utsname uts; /* Change hostname in new UTS namespace */ if (sethostName (arg, strlen(arg)) == -1) errExit("sethostname"); / * to capture and display the host name * / if (uname (& uts) = = 1) errExit (" uname "); printf("uts.nodename in child: %s\n", uts.nodename); /* Keep the namespace open for a while, by sleeping. This allows some experimentation--for example, another process might join the namespace. */ sleep(100); return 0; /* Terminates child */} /* 1 */ #define STACK_SIZE (1024 * 1024) static char child_stack[STACK_SIZE]; int main(int argc, char *argv[]) { pid_t child_pid; struct utsname uts; if (argc < 2) { fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]); exit(EXIT_FAILURE); } /* Call the clone function to create a new UTS namespace with a function and a stack space (why pass the tail pointer, because the stack is inverted); */ child_pid = clone(childFunc, child_stack + STACK_SIZE, /* Because the stack is in reverse, So the tail pointer * / CLONE_NEWUTS | SIGCHLD, argv [1]); if (child_pid == -1) errExit("clone"); printf("PID of child created by clone() is %ld\n", (long) child_pid); /* Parent falls through to here */ sleep(1); /* Allow time for the child process to change the host name */ * Display the host name in the current UTS namespace, */ if (uname(& UTS) == -1) errExit("uname"); printf("uts.nodename in parent: %s\n", uts.nodename); If (waitPID (child_pid, NULL, 0) == -1) /* Wait for child process to end */ errExit("waitpid"); printf("child has terminated\n"); exit(EXIT_SUCCESS); }Copy the code

The program creates a UTS namespace by calling the clone() function with the flag bit CLONE_NEWUTS. The UTS namespace isolates two system identifiers – hostname and NIS domainname – which are set by the two system calls sethostName () and setDomainName () respectively and obtained by the system call uname().

Here’s a look at some of the key parts of the program (we’ll omit error checking for simplicity).

When the program runs, it needs the previous command line argument, which will create a child process to execute in the new UTS namespace. The child process will change the host name in the new UTS namespace to the value provided in the command line argument.

The first key part of the main program is to create the child process through the system call clone() :

child_pid = clone(childFunc, 
                  child_stack + STACK_SIZE,   /* Points to start of 
                                                 downwardly growing stack */ 
                  CLONE_NEWUTS | SIGCHLD, argv[1]);

printf("PID of child created by clone() is %ld\n", (long) child_pid);Copy the code

The child process will start executing in the user-defined function childFunc(), which will take the last argument of Clone (argv[1]) as its argument, and flag bits containing CLONE_NEWUTS, So the child process will execute in the newly created UTS namespace.

The main process then sleeps for a while, giving the child process time to change the hostname in its UTS namespace. Then call uname() to retrieve the host name in the current UTS namespace and display the host name:

sleep(1);           /* Give child time to change its hostname */

uname(&uts);
printf("uts.nodename in parent: %s\n", uts.nodename);Copy the code

Meanwhile, the function childFunc() executed by the child process created by Clone () first changes the hostname to the value provided in the command line argument, and then retrieves and displays the modified hostname:

sethostname(arg, strlen(arg);

uname(&uts);
printf("uts.nodename in child:  %s\n", uts.nodename);Copy the code

The sub-process also slept for a period of time before exiting, so that the new UTS namespace would not be closed and we could have a chance to conduct subsequent experiments.

Execute the program to observe whether the parent process and child process are in different UTS namespace:

UTS namespace Password required to create UTS namespace Password: # uname -n antero # ./demo_uts_namespaces bizarro PID of child created by clone() is 27514 uts.nodename in child: bizarro uts.nodename in parent: anteroCopy the code

Creating namespaces with the exception of the User Namespace requires privileges, or more specifically, the corresponding Linux Capabilities, CAP_SYS_ADMIN. This prevents programs with SUID (Set User ID on execution) from doing silly things with different host names. If you’re not familiar with Linux Capabilities, check out my previous article: Getting Started with Linux Capabilities.

2. Proc file


Each process has a /proc/pid/ns directory. The following files represent each namespace in turn. For example, user represents the User namespace. Starting with kernel version 3.8, each file in this directory is a special symbolic link pointing to $namespace:[$namespace-inode-number], the first half of which is the name of the namespace, The last part of the number represents the handle number for this namespace. The handle number is used to perform certain operations on the namespace associated with the process.

$ ls -l /proc/? /ns # ? PID of the shell Total 0 LRWXRWXRWX. 1 MTK MTK 0 Jan 8 04:12 IPC -> IPC :[4026531839] LRWXRWXRWX. 1 MTK MTK 0 Jan 8 04:12 mnt -> mnt:[4026531840] lrwxrwxrwx. 1 mtk mtk 0 Jan 8 04:12 net -> net:[4026531956] lrwxrwxrwx. 1 mtk mtk 0 Jan 8 04:12 pid -> pid:[4026531836] lrwxrwxrwx. 1 mtk mtk 0 Jan 8 04:12 user -> user:[4026531837] lrwxrwxrwx. 1 mtk mtk 0 Jan 8 04:12 uts -> uts:[4026531838]Copy the code

One use of these symlinks is to verify that two different processes are in the same namespace. If two processes have the same inode number, they belong to the same namespace; otherwise, they belong to different namespaces. The files these symlinks point to are special and cannot be accessed directly. In fact, the files they point to are stored in a file system called NSFS that is not visible to the user. You can use the system call stat() to get the inode number in the st_ino field of the returned structure. From the shell terminal, you can see the inode information pointing to the file using the command (which essentially calls stat()) :

$ stat -L /proc/? /ns/net File: /proc/3232/ns/net Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: 4h/4d Inode: 4026531956 Links: 1 Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-01-17 15:45:23.783304900 +0800 Modify: 2020-01-17 15:45:23.783304900 +0800 Change: 2020-01-17 15:45:23.783304900 +0800 Birth: -Copy the code

In addition to the above purposes, these symbolic links have other uses. If we open one of the files, the namespace will not be deleted even if all processes in the namespace terminate as long as the file descriptor associated with the file is open. The same effect can be achieved by mounting symbolic links elsewhere on the system by bind mount:

$ touch ~/uts
$ mount --bind /proc/27514/ns/uts ~/utsCopy the code

3. setns()


Adding an existing namespace can be done by calling setns(). Its prototype is as follows:

int setns(int fd, int nstype);Copy the code

To be more precise, setns() separates the calling process from an instance of a particular type of namespace and reassociates that process with another instance of that type of namespace.

  • fdThe file descriptor that represents the namespace to be added can be obtained either by opening one of the symbolic links or by opening the files that bind mount to one of the links.
  • nstypeThis allows the caller to check the type of namespace that fd points to. The value can be set to the constant mentioned aboveCLONE_NEW*, fill in0Indicates no check. This parameter can be used to automatically validate if the caller already knows that he is adding a namespace type or does not care about the namespace type.

The combination of setns() and execve() enables a simple but very useful function: add a process to a specific namespace and execute commands in that namespace. Let’s get straight to the example:

/* ns_exec.c Copyright 2013, Michael Kerrisk Licensed under GNU General Public License v2 or later Join a namespace and execute a command in the namespace */ #define _GNU_SOURCE #include <fcntl.h> #include <sched.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> /* A simple error-handling function: print an error message based on the value in 'errno' and terminate the calling process */ #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ } while (0) int main(int argc, char *argv[]) { int fd; if (argc < 3) { fprintf(stderr, "%s /proc/PID/ns/FILE cmd [arg...] \n", argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDONLY); /* Get the file descriptor for the namespace you want to add */ if (fd == -1) errExit("open"); If (setns(fd, 0) == -1) /* Join the namespace */ errExit("setns"); execvp(argv[2], &argv[2]); /* Execute the corresponding command in the added namespace */ errExit("execvp"); }Copy the code

The program requires two or more command-line arguments to run, the first of which indicates the path of specific namespace symbolic links (or bind mount to the file path of those symbolic links); The second parameter represents the name of the program to execute in the namespace corresponding to the symbolic link, as well as the command-line arguments needed to execute the program. The key steps are as follows:

fd = open(argv[1], O_RDONLY); /* Get the file descriptor for the namespace you want to add */ setns(fd, 0); /* Add the namespace */ execvp(argv[2], &argv[2]); /* Run the corresponding command */ in the namespace to be addedCopy the code

Remember that we have mounted UTS namespace created by demo_uts_namespaces to ~/ UTS via bind mount? You can combine this with the program in this example so that the new process can execute the shell in the UTS namespace:

$./ns_exec ~/uts /bin/bash # ~/uts was bind to /proc/27514/ns/uts My PID is: 28788Copy the code

Verify that the new shell is in the same UTS namespace as the child process created by demo_UTs_namespaces:

$ hostname bizarro $ readlink /proc/27514/ns/uts uts:[4026532338] $ readlink /proc/? /ns/uts # ? Represents the PID of the current shell UTS :[4026532338]Copy the code

In earlier kernel versions, setns() could not be used to add mount namespaces, PID namespaces, and user namespaces. Starting with kernel 3.8, setns() supports adding all namespaces.

The util-Linux package provides the nsenter command, which provides a way to run a newly created process in a specified namespace. It is implemented simply by specifying a symbolic link to the namespace to enter through the command line (the -t argument). Then setns() is used to put the current process into the specified namespace, and clone() is called to run the specified execution file. We can use Strace to see how it works:

# strace nsenter -t 27242 -i -m -n -p -u /bin/bash execve("/usr/bin/nsenter", ["nsenter", "-t", "27242", "-i", "-m", "-n", "p", "-u", "/ bin/bash"], [/ * * / vars] 21) = 0..................... ..................... pen("/proc/27242/ns/ipc", O_RDONLY) = 3 open("/proc/27242/ns/uts", O_RDONLY) = 4 open("/proc/27242/ns/net", O_RDONLY) = 5 open("/proc/27242/ns/pid", O_RDONLY) = 6 open("/proc/27242/ns/mnt", O_RDONLY) = 7 setns(3, CLONE_NEWIPC) = 0 close(3) = 0 setns(4, CLONE_NEWUTS) = 0 close(4) = 0 setns(5, CLONE_NEWNET) = 0 close(5) = 0 setns(6, CLONE_NEWPID) = 0 close(6) = 0 setns(7, CLONE_NEWNS) = 0 close(7) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4deb1faad0) = 4968Copy the code

4. unshare()


The final system call to cover is unshare(), which has the following prototype:

int unshare(int flags);Copy the code

Unshare () is similar to clone(), but runs on the same process without creating a new one: a new namespace is created using the flags parameter CLONE_NEW*, and the caller is added to the namespace. The effect is essentially to detach the caller from the current namespace and add a new namespace.

The unshare command is implemented through the unshare() system call as follows:

$ unshare [options] program [arguments]Copy the code

Options Specifies the type of namespace to be created.

The unshare command is implemented as follows:

/* Initialize 'flags' with the supplied command line arguments */ unshare(flags); /* Now execute 'program' with 'arguments'; 'optind' is the index of the next command-line argument after options */ execvp(argv[optind], &argv[optind]);Copy the code

The full implementation of the unshare command is as follows:

/* unshare.c Copyright 2013, Michael Kerrisk Licensed under GNU General Public License v2 or later A simple implementation of the unshare(1) command:  unshare namespaces and execute a command. */ #define _GNU_SOURCE #include <sched.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> /* A simple error-handling function: print an error message based on the value in 'errno' and terminate the calling process */ #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ } while (0) static void usage(char *pname) { fprintf(stderr, "Usage: %s [options] program [arg...] \n", pname); fprintf(stderr, "Options can be:\n"); fprintf(stderr, " -i unshare IPC namespace\n"); fprintf(stderr, " -m unshare mount namespace\n"); fprintf(stderr, " -n unshare network namespace\n"); fprintf(stderr, " -p unshare PID namespace\n"); fprintf(stderr, " -u unshare UTS namespace\n"); fprintf(stderr, " -U unshare user namespace\n"); exit(EXIT_FAILURE); } int main(int argc, char *argv[]) { int flags, opt; flags = 0; while ((opt = getopt(argc, argv, "imnpuU")) ! = -1) { switch (opt) { case 'i': flags |= CLONE_NEWIPC; break; case 'm': flags |= CLONE_NEWNS; break; case 'n': flags |= CLONE_NEWNET; break; case 'p': flags |= CLONE_NEWPID; break; case 'u': flags |= CLONE_NEWUTS; break; case 'U': flags |= CLONE_NEWUSER; break; default: usage(argv[0]); } } if (optind >= argc) usage(argv[0]); if (unshare(flags) == -1) errExit("unshare"); execvp(argv[optind], &argv[optind]); errExit("execvp"); }Copy the code

Unshare. c to execute shell in a new mount namespace:

$ echo ? # according to the current shell PID 8490 $cat/proc / 8490 / mounts | grep mq # displays the current namespace in a hardpoint mqueue/dev/mqueue mqueue Rw, secLabel,relatime 0 0 $readLink /proc/8490/ns-mnt # Display the ID of the current namespace. MNT :[4026531840] $./unshare -m /bin/bash # Execute the new shell $readlink /proc/? in the newly created mount namespace /ns/ MNT # display new namespace ID MNT :[4026532325]Copy the code

Comparing the output of the two readlink commands, you can see that the two shells are in different mount namespaces. Change a mount point in the new namespace, and then observe whether the mount points of both namespaces change:

$umount /dev/mqueue $cat /proc/? / mounts | grep mq # check whether effective $cat/proc / 8490 / mounts | grep mq # to check the mount point in the original namespace is still there? mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0Copy the code

As you can see, the mount point /dev/mqueue in the new namespace has disappeared, but it still exists in the original namespace.

5. To summarize


This article takes a close look at each component of the Namespace API and uses them together. Subsequent articles will continue to delves into each individual namespace, especially the PID namespace and user namespace.

Refer to the link

  • Namespaces in operation, part 2: the namespaces API
  • Docker basic technology: Linux Namespace

Wechat official account

Scan the following QR code to follow the wechat public account, in the public account reply ◉ plus group ◉ to join our cloud native communication group, and Sun Hongliang, Zhang Curator, Yang Ming and other leaders to discuss cloud native technology