Linux Namespace is a method of kernel-level environment isolation provided by Linux. Long ago Unix had a system call called chroot (which jails the user into a specific directory by modifying the root directory) that provided a simple isolation mode: the file system inside chroot could not access the external content. Linux Namespace provides an isolation mechanism for UTS, IPC, MOUNT, PID, network, and User.

For example, we all know that a superparent on Linux has a PID of 1, so, like chroot, if we can jail the user’s process space into a process branch and let the superparent see its PID of 1, like chroot, This allows for resource isolation (processes in different PID namespaces cannot see each other)

In the figure above, we can see that the kernel version of User namespace is finished: 3.8, which is why Docker is often said to run only in centos7.

The main implementation of Namespace is the following three system calls

  • Clone ****() – Implements a thread system call to create a new process and can be isolated by designing the above parameters.
  • Unshare ****() – Detaches a process from a namespace
  • Setns ****() – Adds a process to a namespace

Unshare () and setns() won’t be covered here, but setns will be covered in a follow-up article.

(Docker uses the command Docker exec to enter system calls like setNS used in a container), let’s focus on clone system calls

Clone () system call

Take a look at the simplest clone() system call example:

#define _GNU_SOURCE 
#include <sys/mount.h> 
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024)
static char container_stack[STACK_SIZE];
char* const container_args[] = {
  "/bin/bash".NULL
};
 
int container_main(void* arg)
{  
  printf("Container - inside the container! \n");
  execv(container_args[0], container_args);
  printf("Something's wrong! \n");
  return 1;
}
 
int main(a)
{
  printf("Parent - start a container! \n");
  int container_pid = clone(container_main, container_stack+STACK_SIZE, SIGCHLD , NULL);
  waitpid(container_pid, NULL.0);
  printf("Parent - container stopped! \n");
  return 0;
}
Copy the code

The function of this code is very simple: in main, we create a new child process called container_main using the Clone () system call, which executes a /bin/bash,

However, for the above program, there is no difference in the process space between the parent process and the child process that the parent process can access.

Let’s take a look at a few examples of what a Namespace looks like on Linux.

#define _GNU_SOURCE 
#include <sys/mount.h> 
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024)
static char container_stack[STACK_SIZE];
char* const container_args[] = {
  "/bin/bash".NULL
}; 
int container_main(void* arg)
{  
  printf("Container - inside the container! \n");
  sethostname("test".10); /* Set hostname */
  execv(container_args[0], container_args);
  printf("Something's wrong! \n");
  return 1;
}
int main(a)
{
  printf("Parent - start a container! \n");
  int container_pid = clone(container_main, container_stack+STACK_SIZE, CLONE_NEWUTS | CLONE_NEWPID | SIGCHLD , NULL);
  waitpid(container_pid, NULL.0);
  printf("Parent - container stopped! \n");
  return 0;
}
Copy the code

In the clone system call above, we added two new parameters: CLONE_NEWUTS and CLONE_NEWPID. In other words, UTS and PID Namespace are added.

The hostname of the child process is changed to container, and the current process number is changed to 1.

$ vi test.c
$ gcc -o test test.c
$ ./test
Parent - start a container!
Container - inside the container!
$ hostname
test
$ echo$$1Copy the code

As we know, on traditional UNIX systems, a process with a PID of 1 is init, which has a very special status. As the parent of all processes, it has a lot of privileges (such as blocking signals, etc.) and checks the status of all processes. If a child breaks out of the parent process (the parent process did not wait for it), init takes care of reclaiming resources and terminating the child. Therefore, to achieve process space isolation, first create a process with PID 1, preferably like chroot, child process PID 1 in the container.

However, if you type ps,top and other commands in the shell of the child process, you can still see all the processes, indicating that there is no complete isolation. This is because commands like ps and top read the /proc file system, so since the /proc file system is the same for both parent and child processes, these commands display the same thing.

So, we also need to isolate the file system.

The Mount Namespace isolation

In the following routine, we enable mount namespace and re-mount the /proc filesystem in the child process:

#define _GNU_SOURCE 
#include <sys/mount.h> 
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024)
static char container_stack[STACK_SIZE];
char* const container_args[] = {
  "/bin/bash".NULL
};

int container_main(void* arg)
{
    printf("Container [%5d] - inside the container! \n", getpid());
    sethostname("container".10);
    /* Remount the proc filesystem to /proc */
    system("mount -t proc proc /proc");
    execv(container_args[0], container_args);
    printf("Something's wrong! \n");
    return 1;
}

int main(a)
{
    printf("Parent [%5d] - start a container! \n", getpid());
    /* Enable Mount Namespace - add CLONE_NEWNS parameter */
    int container_pid = clone(container_main, container_stack+STACK_SIZE, 
            CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);
    waitpid(container_pid, NULL.0);
    printf("Parent - container stopped! \n");
    return 0;
}
Copy the code

The running results are as follows:

./mount Parent [16436] - start a container! Container [ 1] - inside the container! $ ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 11:25 pts/0 00:00:00 /bin/bash root 10 1 0 11:25 pts/0 00:00:00 ps  -efCopy the code

Above, we can see that there are only two processes, and the process with pid=1 is our /bin/bash. We can also see that the /proc directory is much cleaner:

ls /proc
1     buddyinfo  consoles  diskstats    fb           iomem     kcore      kpagecgroup  locks    modules  pagetypeinfo  schedstat  softirqs  sysrq-trigger  tty                vmallocinfo
15    bus        cpuinfo   dma          filesystems  ioports   keys       kpagecount   mdstat   mounts   partitions    scsi       stat      sysvipc        uptime             vmstat
23    cgroups    crypto    driver       fs           irq       key-users  kpageflags   meminfo  mtrr     pressure      self       swaps     thread-self    version            zoneinfo
acpi  cmdline    devices   execdomains  interrupts   kallsyms  kmsg       loadavg      misc     net      sched_debug   slabinfo   sys       timer_list     version_signature
Copy the code

After creating a mount namespace with CLONE_NEWNS, the parent process copies its file structure to the child process. All mount operations in the new namespace of the child process only affect its own file system and do not have any impact on the outside world, so that strict isolation can be implemented.

In fact, as ordinary users, what we want is for the container process to see the file system as a separate isolated environment every time it creates a new container, rather than inheriting from the host’s file system, where it can remount its entire root directory “/” before the container process starts. Due to the existence of Mount Namespace, the Mount is not visible to the host, so the container process can play around with it.

On Linux, there is a command called chroot that can help you do this easily in your shell. It helps you “change root file system”, that is, change the root directory of the process to the location you specify. It’s also very simple to use.

Suppose we now have a $HOME/test directory that we want to use as the root of a /bin/bash process.

First, create a test directory and several lib folders:

$ mkdir -p $HOME/test/{bin,lib64,lib}
Copy the code

Then, copy the bash command to the bin path corresponding to the test directory:

$ cp -v /bin/{bash,ls} $HOME/test/bin
Copy the code

Next, copy all the so files required by the bash command to the lib path corresponding to the test directory

Finally, run the chroot command to tell the operating system that we will use the $HOME/test directory as the root of the /bin/bash process:

$ chroot $HOME/test /bin/bash
Copy the code

At this point, if you execute “ls /”, you will see that it returns the contents under the $HOME/test directory, not the contents of the host.

More importantly, the chroot process does not feel that its root directory has been “modified” to $HOME/test.

In fact, Mount Namespace was invented based on a continuous improvement of Chroot, and it was the first Namespace in Linux.

Of course, to make the container’s root look “real,” it is common to mount a full operating system file system, such as the Ubuntu20.04 ISO, under the container’s root. This way, after the container is started, we can view the contents of the root directory in the container by executing “ls /”, which is all the directories and files in Ubuntu 20.04.