Linux introduction

UNIX is an interactive system for simultaneous processing of multiple processes and simultaneous online users. The reason for UNIX is that Linux evolved from UNIX, and UNIX was designed by programmers and primarily served by programmers. Linux inherits the design goals of UNIX. The Linux operating system is everywhere from smartphones to cars, supercomputers and home appliances, from home desktops to enterprise servers.

Most programmers like to keep their systems as simple, elegant, and consistent as possible. For example, at the lowest level, a file should be just a collection of bytes. In order to achieve sequential access, random access, keystroke access, remote access can only hinder your work. Same as if the command

ls A*
Copy the code

Means that only all files beginning with A are listed, then the command

rm A*
Copy the code

All files starting with A should be removed instead of just files with A*. This feature is also the Principle of Least Surprise.

The principle of least surprise is commonly used in user interface and software design. The prototype is that the function or feature should conform to the user’s expectations and should not surprise or shock the user.

Some experienced programmers usually want systems that are more functional and flexible. One of the basic goals of Linux design is for each application to do one thing and do it well. So the compiler just compiles. The compiler doesn’t produce lists, because there are other applications that do it better than the compiler.

Many people don’t like redundancy. Why use copy when CP can describe what you want to do? It’s just a waste of precious hacking time. To extract all lines from the file that contain the string ARD, a Linux programmer should type

grep ard f
Copy the code

Linux interface

The Linux system is a pyramid model system, as shown below

The application makes a system call to place the parameters in the register (and sometimes the stack) and issues a trap system fall instruction to switch from user mode to kernel mode. Because you cannot write trap instructions directly in C, C provides a library of functions that correspond to system calls. Some functions are written in assembly but can be called from C. Each function first puts its arguments in place and then executes the system call instructions. So if you want to perform the read system call, the C program will call the read library to do so. By the way, this is the library interface specified by POSIX, not the system call interface. That is, POSIX tells a standard system which library procedures should be provided, what their arguments are, what they must do, and what results they must return.

In addition to the operating system and system call libraries, the Linux operating system provides standard programs such as text editors, compilers, file manipulation tools, and so on. It is these applications that directly deal with the user. So we can say that Linux has three different interfaces: the system call interface, the library function interface, and the application program interface

The Graphical User Interface (GUI) on Linux is very similar to UNIX in that it creates a desktop environment that includes Windows, targets and folders, toolbars, and drag-and-drop capabilities for files. A complete GUI also includes a window manager as well as various applications.

The GUI on Linux is supported by the X window, whose main components are the X server, control keyboard, mouse, monitor, etc. When using a graphical interface on Linux, users can run programs or open files with a mouse click, copy files by dragging and dropping, and so on.

Linux components

In fact, the Linux operating system can be made up of the following parts

  • BootloaderBootloader is the software that manages the computer startup process. To most users, it’s just a pop up screen, but the internal operating system does a lot of things
  • The Kernel (Kernel)The kernel is the core of the operating system and is responsible for managing CPU, memory, peripheral devices, etc.
  • Init System: this is the subsystem that guides user space and is responsible for controlling the daemon. Once the initial boot has been handed over from the boot loader, it is the initialization system used to manage the boot process.
  • Background process (Daemon)Background processes are programs that run in the background, such as print, sound, scheduling, etc. They can be started during boot or after logging in to the desktop
  • Graphical Server: This is the subsystem that displays graphics on the monitor. This is usually referred to as the X server or X.
  • Desktop EnvironmentThis is where the user actually interacts with it. There are many desktop environments to choose from, each containing built-in applications such as file managers, Web browsers, games, and so on
  • ApplicationsThe desktop environment does not offer complete applications. Just like Windows and macOS, Linux offers thousands of high-quality software that can be easily found and installed.

Shell

Although Linux applications provide a GUI, most programmers prefer to use a command-line interface, called a shell. The user usually starts a shell window in the GUI and then works under the shell window.

Shell command lines are faster, more powerful, easy to expand, and do not cause repetitive strain (RSI).

Here are some of the simplest Bash shells. When the shell starts, it first initializes, prints a prompt on the screen, usually a percent sign or dollar sign, and waits for user input

After the user enters a command, the shell extracts the first word, which is a string of characters separated by Spaces or tabs. Assuming that the word is the name of the program to run, the program is searched, and if it is found, it is run. The shell then suspends itself until the program is finished, and then tries to read the next instruction. Shell is also a common user program. Its main function is to read the user’s input and display the computed output. Shell commands can contain arguments, which are passed as strings to the program being called. Such as

cp src dest
Copy the code

The cp application is called with two arguments, SRC and dest. The program will interpret that the first argument is an existing file name, and then create a copy of the file named dest.

Not all arguments are filenames, as shown below

head -20 file
Copy the code

The first argument, -20, tells the HEAD application to print the first 20 lines of the file instead of the default 10. The parameters that control the operation of the command or specify optional values are called flags, and by convention flags should be represented by -. This notation is necessary, for example

head 20 file
Copy the code

Is a perfectly legal command that tells the head program to output the first 10 lines of a file named 20, and then the first 10 lines of a file named file. The Linux operating system can accept one or more parameters.

To make it easier to specify multiple file names, the shell supports magic characters, also known as wild cards. For example, * can match one or more possible strings

ls *.c
Copy the code

Tell ls to list all files whose names end in.c. If multiple files exist, they will be juxtaposed later.

Another wildcard is the question mark, which matches any character. A set of characters in brackets can represent any one of them, so

ls [abc]*
Copy the code

All files starting with a, B, or C are listed.

Shell applications do not necessarily input and output through a terminal. When the shell starts, it gets the ability to access standard input, standard output, and standard error files.

Standard output is input from the keyboard, and standard output or standard error is output to the monitor. By default, many Linux programs input from standard input and output from standard output. Such as

sort	
Copy the code

Sort is called, reads the data from the terminal (until the user enters CtrL-D), sorts it alphabetically, and prints the results to the screen.

It is also common to redirect standard input and standard output, redirecting standard input with < followed by a file name. The standard output can be redirected with a greater than sign >. Allows redirection of standard input and output from a command. For example, the command

sort <in >out
Copy the code

Sort takes the input from the in file and outputs the result to the out file. Since standard error is not redirected, the error message is printed directly to the screen. A program that reads from standard input, processes it, and writes it to standard output is called a filter.

Consider the following command, which consists of three separate commands

sort <in >temp; head -30 <temp; rm tempCopy the code

The sort application is first called to read from the standard input in and through the standard output to Temp. When the program is finished, the shell runs head, telling it to print the first 30 lines, and prints them on standard output (which defaults to terminal). Finally, the temp temp file is deleted. Gently, you go, you waved a sleeve, do not take a cloud.

The first program on the command line usually produces output, which in the above example is not received by the temp file. However, Linux also provides a simple command to do this, such as the following

sort <in | head -30
Copy the code

| above is called the vertical bar symbol, its meaning from the sort of application, according to the output directly as input without having to create, use and remove temporary files. A collection of commands connected by a pipe symbol is called a pipeline. For example, the following

grep cxuan *.c | sort | head -30 | tail -5 >f00
Copy the code

Lines that contain cXUAN in any.t file are written to standard output and sorted. The first 30 lines of this content are taken out by head and passed to Tail, which in turn passes the last five lines to Foo. This example provides a pipe to connect multiple commands.

You can put a series of shell commands into a file and then run that file as input. The shell processes them in sequence, just like typing a command on a keyboard. Files that contain shell commands are called shell scripts.

A recommended site for learning shell commands is www.shellscript.sh/

A shell script is actually a program. A shell script can assign values to variables. It also contains loop control statements such as if, for, while, etc. The shell is designed to look like C (There is no doubt that C is father). Since the shell is also a user program, the user can choose between different shells.

Linux applications

The command line of Linux is the shell, which consists of a large number of standard applications. There are six main types of applications

  • File and directory operation commands
  • The filter
  • Text program
  • System management
  • Program development tools, such as editors and compilers
  • other

In addition to these standard applications, there are other applications such as Web browsers, multimedia players, picture browsers, office software, and game programs.

We’ve seen several Linux applications in the examples above, such as Sort, CP, LS, and HEAD, so let’s take a look at other Linux applications.

Let’s start with a couple of examples, like

cp a b
Copy the code

Is to make a copy of A as b, and

mv a b
Copy the code

Move a to B, but delete the original file.

There are some differences between the above two commands. Cp is to copy the file. After the copy is complete, there will be two files a and b. Mv is just a move of a file, and when it’s done, there’s no more file A. The cat command can connect the contents of multiple files. Use RM to delete files. Using chmod allows the owner to change access permissions; You can use the mkdir and rmdir commands to create and delete files and directories. You can use ls to view directory files. Ls displays many properties, such as size, user, and creation date. Sort determines the order in which files are displayed

Linux applications also include the filter grep, which extracts a particular pattern of lines from standard input or from one or more input files; Sort sorts the input and outputs it to standard output; Head extracts the first few lines of input; Tail extracts the last few lines of input; Other filters include Cut and paste, which allow you to cut and copy lines of text; Od converts the input to ASCII; Tr to achieve character case conversion; Pr for formatting print output, etc.

The program compiler tool uses GCC;

The make command is used for automatic compilation. This is a powerful command used to maintain a large program whose source code consists of many files. Typically, there are header files, which are usually included in the source file using the include directive. What make does is keep track of which files belong to the header file and then schedule the automatic compilation process.

The standard POSIX applications are listed below

The program application
ls List the directory
cp Copy the file
head Displays the first few lines of the file
make Compile files to generate binary files
cd Switch directory
mkdir Create a directory
chmod Example Modify file access permission
ps Listing file processes
pr Formatted print
rm Deleting a file
rmdir Deleting a file directory
tail Extract the last few lines of the file
tr Character set conversion
grep grouping
cat Stdout multiple files consecutively
od Displays files in octal
cut Cut from the file
paste Paste it from the file

Linux Kernel Structure

Having seen the overall structure of Linux above, let’s take a look at the Linux kernel structure as a whole

The kernel sits directly on the hardware, and its main functions are I/O interaction, memory management, and control of CPU access. The figure above also includes interrupts and schedulers. Interrupts are the primary way to interact with the device. The scheduler comes into play when an interrupt occurs. The low-level code here stops the running process, saves its state in the kernel process structure, and starts the driver. Process scheduling also occurs when the kernel completes some operations and starts user processes. The dispatcher in the diagram is the Dispatcher.

Note that the dispatcher here is not the scheduler. There is a difference between the two

Scheduler and Dispatcher are both concepts related to process scheduling. The difference is that scheduler randomly selects one process from several processes. The Dispatcher assigns CPUS to processes selected by the Scheduler.

Then, we divided the kernel system into three parts.

  • The I/O section is responsible for all the parts of the kernel that interact with devices and perform network and storage I/O operations.

The diagram shows the RELATIONSHIP between the I/O hierarchy. At the top level is a virtual file system. That is, whether files come from memory or disk, they pass through the virtual file system. At the bottom level, all drivers are either character or block device drivers. The main difference is whether random access is allowed. The network driver device is not a separate driver device, it is actually a character device, but the network device is handled differently from the character device.

In the above device driver, the kernel code is different for each device type. Character devices can be used in two ways, with one-key ones such as VI or Emacs that require each keyboard input. Others, such as the shell, need to enter a line and press enter to send the string to the program for editing.

Network software is often modular, supported by different devices and protocols. Most Linux systems include the functionality of a complete hardware router in the kernel, but this cannot be compared to external routers. On top of the router is the protocol stack, including TCP/IP. On top of the protocol stack is the socket interface, which is responsible for external communication and acts as a gate.

Above the disk drive is the I/O scheduler, which sorts and allocates disk reads and writes to minimize unwanted head movement.

  • To the right of THE I/O is the memory component, the program is loaded into memory, executed by the CPU, and there are virtual memory components involved, how the page swapping in and out works, the replacement of bad pages and the caching of frequently used pages.

  • The process module is responsible for the creation and termination of processes, and the scheduling of processes. Linux treats processes and threads as runnable entities and schedules them using a uniform scheduling policy.

At the very top of the kernel is the system call interface, through which all system calls pass. The system call triggers a trap that converts the system from user mode to kernel mode, and then transfers control to the above kernel component.

Linux processes and threads

Let’s take a closer look at the Linux kernel to understand the basic Linux concepts of processes and threads. System calls are interfaces to the operating system itself that are important for creating processes and threads, allocating memory, sharing files, and I/O.

We will discuss the common features of each version.

The basic concept

Each process runs a separate program and has a separate thread of control when it is initialized. In other words, each process has its own program counter, which is used to record an instruction that needs to be executed. Linux allows processes to create additional threads at run time.

Linux is a multiprogramming system, so there are independent processes running at the same time. In addition, each user has several processes active simultaneously. If you have a large system, you might have hundreds or thousands of processes running simultaneously.

In some user Spaces, background processes, called daemons, are still running even when the user is logged out.

There is a special kind of daemon in Linux called the Cron daemon. The Cron daemon wakes up every minute to check if there is work to be done, and then goes back to sleep to be woken up again.

Cron is a daemon that can do anything you want, such as regular maintenance, regular backups, etc. There are similar programs on other operating systems, such as the Cron daemon on Mac OS X called the launchd daemon. This can be called a Task Scheduler on Windows.

On Linux systems, processes are created in a very simple way, with the fork system call creating a copy (copy) of the source process. The process that calls the fork function is called the parent process, and the process created by the fork function is called the child process. Both the parent and child processes have their own memory images. If, after the child is created, the parent changes variables, etc., then the child will not see the changes, that is, after fork, the parent and child are independent of each other.

Although the parent and child processes remain independent of each other, they can share the same file. If the parent opened a file before the fork, the parent and child processes will still share the open file after the fork. Changes to shared files are visible to both parent and child processes.

So how do you distinguish between parent and child processes? The child process is just a copy of the parent process, so they are almost all the same, including memory images, variables, registers, and so on. The key is the return value of fork. If fork returns a non-zero value, the non-zero value is the Process Identiier (PID) of the child Process, and the zero value is returned to the child Process. This can be represented by the following code

pid = fork();    // Call the fork function to create the process
if(pid < 0){
  error()				 // Pid < 0, failed to create
}
else if(pid > 0){
  parent_handle() // Parent process code
}
else {
  child_handle()  // Child process code
}
Copy the code

After fork, the parent process will get the PID of the child process. This PID can represent the unique identifier of the child process. If the child process wants to know its own PID, it can call the getPID method. When the child finishes running, the parent gets the child’s PID, which is important because a process forks many children and children fork children. We call the process after the first call to fork the original process, a original process can generate an inheritance tree

Linux interprocess communication

The communication mechanism between Linux processes is usually called internel-process communication,IPC. In general, the communication mechanism between Linux processes can be divided into six types

Below we outline each of them

Signal signal

Signals are the first interprocess communication mechanism used in UNIX systems. Since Linux inherits from UNIX, Linux also supports signaling by sending asynchronous event signals to one or more processes. Signals can be generated from a keyboard or access to a non-existent location. The signal sends the task to the child process through the shell.

You can type kill -l on A Linux system to list the signals used by the system. Here are some of the signals I provided

Processes can choose to ignore incoming signals, but two cannot be ignored: SIGSTOP and SIGKILL signals. The SIGSTOP signal tells the current running process to perform the shutdown operation, and the SIGKILL signal tells the current process that it should be killed. In addition, the process can choose which signal it wants to process, the process can choose to block the signal, and if not, it can choose to process the signal itself, or it can choose to do the kernel processing. If you choose to delegate processing to the kernel, default processing is performed.

The operating system interrupts the process of the target program to send a signal to it. In any non-atomic instruction, execution can be interrupted. If the process has registered a new number handler, the process executes; if not, the default processing is used.

For example, when a process receives a SIGFPE floating point exception, the default action is to dump and exit. Signals have no priority. If two signals are generated for a process at the same time, they can be presented to the process or processed in any order.

So let’s see what these signals are for

  • SIGABRT and SIGIOT

The SIGABRT and SIGIOT signals are sent to the process to tell it to terminate, usually started by the process itself when the C library abort() function is called

  • SIGALRM, SIGVTALRM, and SIGPROF

SIGALRM, SIGVTALRM, and SIGPROF are sent to the process when the set clock function times out. SIGALRM is sent when the actual or clock time times out. SIGVTALRM is sent when the CPU time used by the process times out. SIGPROF is sent when the CPU time used by the process and the system on behalf of the process times out.

  • SIGBUS

SIGBUS is sent to the process when a bus interrupt error is caused

  • SIGCHLD

When the child process terminates, is interrupted, or is recovered from interruption, the SIGCHLD is sent to the process. A common use of this signal is to instruct the operating system to clear resources used by a child process after it terminates.

  • SIGCONT

The SIGCONT signal instructs the operating system to continue the process that was previously suspended by the SIGSTOP or SIGTSTP signal. An important use for this signal is in job control in a Unix shell.

  • SIGFPE

The SIGFPE signal is sent to the process when an incorrect arithmetic operation (such as dividing by zero) is performed.

  • SIGUP

When a SIGUP controlled terminal is closed, it is sent to the process. Many daemons will reload their configuration files and reopen their log files rather than exit upon receiving this signal.

  • SIGILL

The SIGILL signal is emitted when an illegal, ill-formed, unknown, or privileged instruction is attempted

  • SIGINT

When a user wishes to interrupt a process, the operating system sends a SIGINT signal to the process. By typing Ctrl-C, the user wants to interrupt the process.

  • SIGKILL

The SIGKILL signal is sent to the process to terminate immediately. In contrast to SIGTERM and SIGINT, this signal cannot be captured and ignored for execution, and the process cannot perform any cleanup operations after receiving this signal, with some exceptions

A zombie process cannot be killed because it is already dead and waiting for its parent process to capture it

A blocked process can only be killed if it is wakened again

The init process is the Linux initialization process, which ignores any signals.

SIGKILL is usually sent to the process as a signal to kill the process last. It is usually sent to the process if there is no response to SIGTERM.

  • SIGPIPE

SIGPIPE is sent to the process when it attempts to write to the process pipeline and finds that the pipeline is not connected

  • SIGPOLL

A SIGPOLL signal is sent when an event occurs on an explicitly monitored file descriptor.

  • SIGRTMIN to SIGRTMAX

SIGRTMIN to SIGRTMAX is a real-time signal

  • SIGQUIT

When a user requests to exit the process and perform a core dump, the SIGQUIT signal is sent to the process from its control terminal.

  • SIGSEGV

The SIGSEGV signal is sent to the process when it makes an invalid virtual memory reference or a segmentation error, that is, when a segmentation violation is executed.

  • SIGSTOP

SIGSTOP indicates when the operating system is terminated for later recovery

  • SIGSYS

The SIGSYS signal is sent to the process when the error parameter is passed to the system call.

  • SYSTERM

We briefly mentioned the term SYSTERM, which is a signal sent to the process to request termination. Unlike the SIGKILL signal, this signal can be captured or ignored by the process. This allows the process to perform a good termination, freeing up resources and saving state when appropriate. SIGINT is almost the same as SIGTERM.

  • SIGTSIP

The SIGTSTP signal is sent to the process by its control terminal to request the terminal to stop.

  • SIGTTIN and SIGTTOU

The SIGTTIN and SIGTTOU signals are sent to the process when the SIGTTIN and SIGTTOU signals, respectively, attempt to read or write from the TTY in the background.

  • SIGTRAP

The SIGTRAP signal is sent to the process in case of an exception or trap

  • SIGURG

The SIGURG signal is sent to the process when the socket has readable emergency or out-of-band data.

  • SIGUSR1 and SIGUSR2

The SIGUSR1 and SIGUSR2 signals are sent to the process to indicate user-defined conditions.

  • SIGXCPU

The SIGXCPU signal is sent to the process when it deplets the CPU for longer than a predetermined user-configurable value

  • SIGXFSZ

When the SIGXFSZ signal grows beyond the maximum allowable size of the file, the signal is sent to the process.

  • SIGWINCH

The SIGWINCH signal is sent to the process when its control terminal changes its size (window changes).

Pipeline pipe

Processes in A Linux system can communicate by creating pipes.

A channel can be created between two processes, one to which the byte stream is written, and the other to which the byte stream is read. Pipes are synchronous, and when a process attempts to read data from an empty pipe, the process is blocked until data is available. Pipelines in the shell are implemented with pipelines, when the shell finds output

sort <f | head
Copy the code

It creates two processes, sort and head sort, and it creates a pipe between the two applications so that the standard output of sort is the standard input of head. The output produced by sort is not written to the file, and if the pipe is full the system stops sort to wait for head to read the data

Pipeline is actually |, two applications don’t know the existence of pipeline, everything is by the management and control of the shell.

Shared memory Shared memory

Interprocess communication between two processes can also be conducted through shared memory, where two or more processes have access to a common memory space. Shared work between the two processes is done through shared memory, and changes made by one process can be visible to the other (much like communication between threads).

Before using shared memory, you need to go through a series of calls, which are as follows

  • Create a shared memory segment or use an existing shared memory segment(shmget())
  • Append the process to an already created memory segment(shmat())
  • Detaches the process from the connected shared memory segment(shmdt())
  • Perform control operations on the shared memory segment(shmctl())

First-in, first-out queue FIFO

First-in, first-out (FIFO) queue FIFOs are often referred to as Named Pipes, and Named Pipes work much like regular Pipes, although they do have some notable differences. Unnamed pipes have no backup files: the operating system maintains buffers in memory that are used to transfer bytes from writers to readers. Once the write or output is terminated, the buffer is reclaimed and the transmitted data is lost. In contrast, named pipes have supporting files and unique apis, and named pipes exist as device-specific files in the file system. When all process communication is complete, the named pipe is kept in the file system for later use. Named pipes have strict FIFO behavior

The first byte written is the first byte read, the second byte written is the second byte read, and so on.

Message Queue

You may not know what the term message queue means when you hear it, but a message queue is used to describe an internal list of links in the kernel’s addressing space. Messages can be sent sequentially to and retrieved from a queue in several different ways. Each message queue is uniquely identified by an IPC identifier. There are two modes of message queuing, one is strict mode, strict mode is like FIFO first in first out queue, messages are sent sequentially, messages are read sequentially. Then there is the non-strict pattern, where the ordering of messages is not very important.

The Socket Socket

Another way to manage communication between two processes is to use sockets, which provide end-to-end dual-phase communication. A socket can be associated with one or more processes. Just like pipes have command pipes and unnamed pipes, sockets come in two modes. Sockets are generally used for network communication between two processes. Network sockets require support from a base protocol such as TCP (Transmission Control Protocol) or, at a lower level, UDP (User Datagram Protocol).

There are several types of sockets

  • Sequential Packet Socket: This socket provides a reliable connection for datagrams of a fixed maximum length. This connection is bidirectional and sequential.
  • Datagram Socket: Packet socket supports bidirectional data flow. The packet socket may receive messages in a different order than the sender.
  • Stream SocketA: stream socket works like a telephone conversation, providing a reliable two-way stream of data.
  • Raw Socket: Primitive sockets can be used to access the underlying communication protocol.

Processes manage system calls in Linux

Now focus on the system calls related to process management on a Linux system. Before you know that, you need to know what a system call is.

The main function of the operating system is to provide the user with an abstraction, to hide the internal implementation, and to let the user only care about how to use it under the GUI graphical interface. The operating system can be divided into two modes

  • Kernel mode: Mode used by the operating system kernel
  • User mode: The mode used by the user application

Context switching refers to frequent switching between kernel-mode and user-mode modes. A system call, which usually runs silently in the background, indicates that a computer program requests services from its operating system kernel.

There are many system call instructions, and the following are some of the most important system calls related to process management

fork

The fork call is used to create a child process that is identical to the parent process. The child process has the same program counters, CPU registers, and open files as the parent process.

exec

The exec system call is used to execute files that reside in the active process, and after exec is called, the new executable replaces the previous executable and gets executed. That is, when exec is called, the old file or program is replaced with a new file or execution, and then the file or program is executed. The new executable is loaded into the same execution space, so the PID of the process does not change, because we are not creating a new process, we are just replacing the old process. But the process’s data, code, and stack have all been modified. If the process being replaced contains more than one thread, all threads are terminated and the new process image is loaded for execution.

The concept of a Process image needs to be explained here

What is a process image? A process image is an executable file that you need to execute your program. It usually includes the following things

  • Codesegment/Textsegment

Also known as text segment, a block of memory used to hold instructions to run code

This space size is determined before the code runs

Memory Spaces are generally read-only, and code in some schemas is allowed to be writable

In the code snippet, it is also possible to include read-only constants such as string constants.

  • Datasegment

Can read but write

Stores initialized global variables and initialized static variables

The lifetime of the data in the data segment is persistence with the program (persistence with the process) persistence with the process: it exists as soon as the process is created and disappears as soon as the process dies

  • BSS segment:

Can read but write

Stores uninitialized global variables and uninitialized static variables

Data in the BSS segment generally defaults to 0

  • The Data segment

Is read and write because the value of the variable can be changed at run time. The size of this segment is also fixed.

  • Stack:

Can read but write

Stores local (non-static) variables in a function or code

The life of the stack continues with the block of code, giving you space when the block runs, and automatically reclaims space when the block ends

  • Heap:

Can read but write

The storage is the malloc/ realLOc space allocated dynamically during the program run

The lifetime of the heap continues with the process, from malloc/realloc to free

The following is a diagram of the composition of these regions

An exec system call is a collection of functions that are

  • execl
  • execle
  • execlp
  • execv
  • execve
  • execvp

Here’s how exec works

  1. The current process image is replaced with the new process image
  2. The new process image is the one you pass as exec
  3. End the running process
  4. The new process image has a PID, the same environment, and some file descriptors (because the process is not replaced, just the process image is replaced)
  5. The CPU status and virtual memory are affected. The virtual memory mapping of the current process image is replaced by the virtual memory mapping of the new process image.

waitpid

Wait for the child process to finish or terminate

exit

On many computer operating systems, the termination of a computer process is performed by executing the exit system call command. 0 indicates that the process can be terminated normally, and other values indicate that the process is terminated abnormally.

Some other common system calls are as follows

System call instruction describe
pause Hang up the signal
nice Change the priority of time-sharing processes
ptrace Process tracking
kill Send a signal to the process
pipe Create the pipe
mkfifo Create a special file for fifO (named pipe)
sigaction Sets the handling method for the specified signal
msgctl Message controlled operation
semctl Semaphore control

Implementation of Linux processes and threads

Linux process

In the Linux kernel structure, processes are represented as tasks and created using the structure. Unlike other operating systems that distinguish between processes, lightweight processes, and threads, Linux uses a unified task structure to represent the execution context. Thus, for each single-threaded process, a single-threaded process will be represented by a task structure, and for multithreaded processes, a task structure will be assigned to each user-level thread. The Linux kernel is multi-threaded, and kernel-level threads are not associated with any user-level threads.

For each process, there is a task_struct process descriptor corresponding to it in memory. The process descriptor contains all the useful information about the kernel management process, including scheduling parameters, open file descriptors, and so on. The process descriptor has been in the kernel stack since the process was created.

Linux, like Unix, uses PID to distinguish different processes, and the kernel will compose the task structure of all processes into a two-way linked list. The PID can be mapped directly to the address of the task structure called the process, thus direct access without traversing the two-way linked list.

We mentioned the process descriptor, which is a very important concept, we also mentioned above process descriptor is located in memory, we omit the sentence here, and that is the process descriptor is the user’s task structure, when the process is located in the memory and start running, the process description FuCai will be transferred to memory.

A Process In Memory is called a Process In Memory (PIM), which is a representation of von Neumann’s architecture. A program that is loaded into Memory and executed is called a Process. Simply put, a process is a program that is executing.

Process descriptors fall into the following categories

  • Scheduling Parameters: The process priority, the last CPU consumption, and the last sleep time together determine the next process to run
  • Memory imageAs we said above, a process image is an executable file that is required to execute a program. It consists of data and code.
  • Signal (signals): Shows which signals are captured and which signals are executed
  • register: The contents of registers are saved in case of a kernel trap.
  • System Call State: Information about the current system call, including parameters and results
  • File Descriptor table: I-Node data structure that locates related files in the file descriptor table with file descriptors as indexes when the system related to file descriptors is called
  • Accounting StatisticsSome operating systems also store the maximum CPU time used by a process, the maximum stack space a process has, and the number of pages a process can consume.
  • Kernel Stack: a fixed stack that can be used by the kernel part of a process
  • other: Current process status, event waiting time, timeout time from the alert, PID, parent process PID, user identifier, etc

With this information, it is now easy to describe how these processes are created in Linux, and creating new processes is actually quite simple. Create a new user-space process descriptor for the child process and then copy a large amount of content from the parent process. Assign the child a PID, set its memory map, give it access to the parent’s files, register and start.

When the fork system call is executed, the calling process plunges into the kernel and creates data structures related to the task, such as the kernel stack and thread_info structures.

Refer to the thread_info structure

Docs.huihoo.com/doxygen/lin…

This structure contains process descriptors, which are located in a fixed location, allowing a Linux system to locate the data structure of a running process with little overhead.

The main content of the process descriptor is populated according to the parent process descriptor. The Linux operating system looks for an available PID that is not being used by any process, and updates the process identifier to point to a new data structure. To reduce hash table collisions, process descriptors form linked lists. It also sets the fields of the task_struct to point to the corresponding previous/next process on the task array.

Task_struct: a Linux process descriptor that involves a lot of C++ source code, which we’ll cover later.

In principle, memory area is opened up for the child process and data segment and stack segment are allocated for the child process, and the contents of the parent process are copied, but in fact, after fork is completed, the child process and the parent process do not share memory, so replication technology is needed to achieve synchronization, but the replication cost is relatively large. So the Linux operating system uses a spoofing method. The child process is assigned a page table, and the newly assigned page table points to the parent process’s pages, which are read-only. When the process writes to these pages, the protection error is enabled. When the kernel detects a write operation, it allocates a copy for the process to copy data to. This copy is shared. This method is called copy on write, which avoids the need to maintain two copies in the same memory area and saves memory space.

After the child process starts running, the operating system calls the exec system call, and the kernel does a lookup to verify the executable, copies the parameters and environment variables into the kernel, and frees up the old address space.

Now the new address space needs to be created and populated. If the system supports mapped files, as it does on Unix systems, a new page table is created to indicate that there are no pages in memory, unless the pages being used are stack pages whose address space is supported by an executable on disk. When a new process starts running, it immediately receives a Page fault, which causes pages with code to load into memory. Finally, the parameters and environment variables are copied to the new stack, the signal is reset, and the registers are all cleared to zero. The new command starts to run.

Here’s an example: the user prints ls, the shell calls fork to copy a new process, and the shell process calls exec to overwrite its memory with the contents of the executable ls.

Linux threads

Now let’s talk about threads in Linux. Threads are lightweight processes, which you’ve heard many times before, lightweight because all process switches require clearing all tables, shared information between processes is cumbersome, usually through pipes or shared memory. If the parent and child processes are forked, the shared file is used, but thread switching does not need to be as expensive as a process, and it is easier for threads to communicate. There are two kinds of threads: user-level threads and kernel-level threads

User-level thread

User-level threads avoid using the kernel. Typically, each thread will display a call switch, send a signal, or perform some kind of switch operation to abandon the CPU. Again, timers can force the switch. The problem with implementing threads at the user level is that a single thread can monopolize the CPU time slice, starving other threads to death. If an I/O operation is performed, I/O blocks and no other thread can run.

One solution is that some user-level thread packages solve this problem. You can use a monitor of clock cycles to control first time slice exclusivity. Some libraries then solve the I/O blocking problem for system calls through special wrappings, or they can write tasks for non-blocking I/O.

Kernel level thread

Kernel-level threads are typically implemented in the kernel using several process tables, one for each task. In this case, the kernel schedules each thread within the time slice of each process.

All calls that can block are made by means of system calls, and when a thread blocks, the kernel can choose whether to run another thread in the same process (if there are any threads ready) or a thread in another process.

Going from user space to kernel space to user space is expensive, but the time cost of thread initialization is negligible. The benefit of this implementation is that the clock determines the thread switching time, so it is less likely to bind the time slice to the other thread occupancy time in the task. Similarly, I/O blocking is not a problem.

Hybrid implementation

Combining the advantages of user-space and kernel space, the designers took a kernel-level threading approach and then multiplexed the user-level threads with some or all of the kernel threads

In this model, the programmer is free to control the number of user threads and kernel threads, with great flexibility. With this approach, the kernel only recognizes kernel-level threads and schedules them. Some of these kernel-level threads are multiplexed by multiple user-level threads.

Linux scheduling

To look at the scheduling algorithm for A Linux system, first recognize that the threads of a Linux system are kernel threads, so the Linux system is thread-based, not process-based.

For scheduling purposes, Linux systems divide threads into three categories

  • Real-time first in, first out
  • Real-time polling
  • time-sharing

The real-time FIRST-in, first-out thread has the highest priority and will not be preempted by another thread unless it is just ready for a higher-priority thread to enter. Real-time rotation threads are basically the same as real-time first-in, first-out threads, except that each real-time rotation thread has a certain amount of time before it can be preempted. If multiple real-time threads are ready, then each thread runs for the specified amount of time before being inserted at the end of the real-time rotation thread.

Note that this real-time is only relative, not absolute real-time, because the running time of the thread cannot be determined. Compared with time-sharing system, they are more real-time

Linux assigns a nice value to each thread, which represents the concept of priority. The nice value is 0 by default, but can be changed through a system call to the nice value. Change the value range from -20 to +19. The nice value determines the static priority of the thread. In general, the nice value of a system administrator has a higher priority than that of a common thread. The nice value ranges from -20 to -1.

Let’s look in more detail at the two scheduling algorithms for Linux, which are internally similar to the runqueue design. The runqueue has a data structure that monitors all runnable tasks in the system and selects the next one to run. Each run queue is related to each CPU in the system.

The Linux O(1) scheduler has been a popular scheduler throughout history. The name comes from its ability to schedule tasks in constant time. In the O(1) scheduler, the schedule queue is organized into two arrays, an array for active tasks and an array for expired tasks. As shown in the figure below, each array contains 140 linked list heads, each with a different priority.

The general process is as follows:

The scheduler selects the highest-priority task from the active array. If the task’s time slice expires, move it to the expiration array. If the task is blocked, such as waiting for the I/O events, so before its time slice expired, once the I/O operation is complete, then this task will continue to run, it will be back before being in the array of activities, because before this task has consumed part of CPU time slice, so it will run the rest of the time slice. When the task has run its time slice, it is put into the expired array. Once there are no other tasks in the active array, the scheduler will swap Pointers, making the active array obsolete and the expired array active. Using this method ensures that each priority task can be executed without causing thread starvation.

In this scheduling mode, tasks with different priorities get different time slices allocated by THE CPU. Processes with higher priorities usually get longer time slices, while tasks with lower priorities get less time slices.

In this way, in order to ensure better service, interactive processes are usually given higher priority. Interactive processes are user processes.

Linux systems don’t know whether a task is I/O intensive or CPU intensive, it just depends on the interactive way that Linux systems distinguish between static and dynamic priorities. Dynamic priority is achieved using a reward system. Rewards work in two ways: rewarding interactive threads and punishing cpu-hogging threads. In the Linux O(1) scheduler, the highest priority reward is -5. Note that the lower this priority is, the more likely it is to be accepted by the thread scheduler, so the highest penalty priority is +5. The operating system maintains a variable called sleep_avg, which increases when a task wakes up, and decreases when the task is preempted or the amount of time expires, which is reflected in the reward mechanism.

The O(1) scheduler is the 2.6 kernel version of the scheduler, originally introduced in the unstable 2.5 version. Early scheduling algorithms in multiprocessor environments demonstrated that scheduling decisions can be made by accessing active arrays. So that the schedule can be completed at a fixed time O(1).

What does it mean that the O(1) scheduler uses a heuristic approach?

In computer science, a heuristic is a way to solve a problem quickly when traditional solutions are slow, or to find an approximate solution in a situation where traditional methods cannot find any exact solution.

O(1) uses heuristics in this way, which makes task priorities complex and imperfect, resulting in poor performance when dealing with interactive tasks.

To overcome this shortcoming, the O(1) Scheduler developers proposed a new solution, namely the Completely Fair Scheduler (CFS). The main idea of CFS is to use a red-black tree as a scheduling queue.

The data structure is too important.

CFS ranks tasks in the tree according to how long they have been running on the CPU, down to the nanosecond. The following is the construction model of CFS

The CFS scheduling process is as follows:

The CFS algorithm always prioritises the tasks that use the least CPU time. The smallest task is usually on the far left. When a new task needs to be run, CFS compares the task to the leftmost value, if the task has the minimum time value, it will run, otherwise it will compare and find the right place to insert. The CPU then runs the leftmost task in the current comparison on the red-black tree.

The time to select a node in the red-black tree to run can be constant, but the time to insert a task is O(loog(N)), where N is the number of tasks in the system. This is acceptable given the current load levels on the system.

The scheduler only needs to consider runnable tasks. These tasks are placed in the appropriate scheduling queues. Non-runnable tasks and tasks waiting for various I/O operations or kernel events are placed in a wait queue. The wait queue header contains a pointer to the task list and a spin lock. Spin-locks are useful in concurrent processing scenarios.

Synchronization in Linux

Let’s talk about synchronization in Linux. Early Linux kernels had only one Big Kernel Lock (BKL). It blocks the ability of different processors to process concurrently. Therefore, some more fine-grained locking mechanisms need to be introduced.

Linux provides several different types of synchronization variables that can be used in both the kernel and user applications. In the stratum, Linux provides encapsulation for hardware-supported atomic instructions by using operations such as atomic_set and atomic_read. The hardware provides memory reordering, which is the mechanism of the Linux barrier.

The description of a spin-lock is that when two processes access a resource at the same time, after one process obtains the resource, the other process doesn’t want to be blocked, so it spins and waits a while to access the resource. Linux also provides mechanisms such as mutex or semaphores, and also supports non-blocking calls such as mutex_tryLock and mutex_tryWait. Interrupt processing transactions are also supported by dynamically disabling and enabling corresponding interrupts.

Linux boot

Let’s talk about how Linux starts.

After the computer is powered On, the BIOS performs a power-on self-test (POST) to detect and initialize the hardware. Because the operating system starts using a disk, screen, keyboard, mouse and other devices. Next, the first partition on the disk, also known as the MBR(Master Boot Record), is read into a fixed memory area and executed. This partition has a very small program, only 512 bytes. The boot program copies itself to memory at higher addresses to free memory at lower addresses for the operating system.

After the replication is complete, the boot program reads the root directory of the boot device. The boot program must understand file system and directory formats. The boot program is then called into the kernel, handing over control to the kernel. Until here, boot has done its job. The system kernel starts to run.

The kernel startup code is done in assembly language, including creating the kernel stack, identifying THE CPU type, calculating memory, disabling interrupts, starting the memory management unit, etc., and then calling the MAIN function of C language to execute the operating system part.

This part also does a lot of things. First, it allocates a message buffer to hold any problems arising from debugging, and the debugging information is written to the buffer. If debugging goes wrong, this information can be called up by the diagnostics.

Then the operating system will automatically configure the device, detect the device, and load the configuration file. If the detected device responds, it will be added to the linked device table. If there is no corresponding device, it will be classified as unconnected and directly ignored.

After configuring all the hardware, the next thing to do is carefully work with process 0 manually, set up its stack, then run it, perform initialization, configure the clock, and mount the file system. Create init processes (process 1) and daemons (process 2).

The init process checks its flag to determine whether it is serving single or multiple users. In the former case, it calls the fork function to create a shell process and waits for it to finish. In the latter case, fork is called to create a process that runs a system-initialized shell script (i.e., /etc/rc), which can perform file system consistency checks, mount file systems, start daemons, and so on.

The /etc/rc process then reads data from /etc/ttys, which lists all terminals and properties. For each enabled terminal, the process calls the fork function to create a copy of itself, performs internal processing, and runs a program called Getty.

The Getty program will type on the terminal

login:
Copy the code

Wait for the user to enter a user name. After that, the Getty program ends and the log-in program /bin/login starts running. The login program needs to enter a password and compare it with the password stored in /etc/passwd. If the password is correct, the login program replaces itself with the user shell program and waits for the first command. If not, the login program asks for another user name.

The entire system startup process is as follows

Linux Memory Management

The Linux memory management model is straightforward because the mechanism of Linux makes it portable and enables Linux to be implemented on machines with similar memory management units. Let’s look at how Linux memory management is implemented.

The basic concept

Every Linux process has an address space, which consists of three segments: TEXT, data, and STACK. The following is an example of the process address space.

Data segments contain the storage of program variables, strings, arrays, and other data. The data segment is divided into two parts, the data that has been initialized and the data that has not been initialized. The uninitialized data is what we call BSS. The initialization of the data section requires a constant determined by compilation and a variable with an initial value when the program is started. All variables in the BSS section are initialized to 0 after loading.

Unlike Text segments, data segments can be changed. A program always modifies its variables. Moreover, many programs require dynamic allocation of space at execution time. Linux allows data segments to grow and shrink as memory is allocated and reclaimed. To allocate memory, a program can increase the size of a data segment. There is a set of standard libraries in the C language, malloc, that are often used to allocate memory. The process address space descriptor contains a dynamically allocated area of memory called the heap.

The third segment is the stack segment. On most machines, the stack segment will be at the top address of the virtual memory address and will extend down to address space zero. For example, on 32-bit x86 machines, the stack starts at 0xC0000000, which is the 3GB virtual address limit that processes can see in user mode. If the stack continues to grow beyond the stack segment, a hardware failure occurs and the page is lowered by one page.

When the program starts, the stack area is not empty; instead, it contains all of the shell environment variables and the command lines typed into the shell in order to invoke it. For example, when you type in

cp cxuan lx
Copy the code

The cp program will run on the stack with the string cp cxuan lx so that the names of the source and destination files can be found.

When two users are running in the same program, such as an editor, two copies of the editor program code are kept in memory, but this is not efficient. Linux supports shared text segments as an alternative. In the figure below we can see that processes A and B have the same text area.

The data and stack segments are shared only after fork, and the unmodified pages are also shared. If either one needs to be larger but there is no adjacent space to accommodate it, there is no problem because adjacent virtual pages do not have to map to adjacent physical pages.

In addition to dynamically allocating more memory, processes in Linux can access file data through memory-mapped files. This feature allows us to map a file to a part of process space and that file can be read and written like an array of bytes in memory. Mapping a file into it makes random reads and writes much easier than using I/O system calls like read and write. Access to shared libraries uses this mechanism. As shown below.

We can see that two identical files are mapped to the same physical address, but they belong to different address Spaces.

The advantage of a mapped file is that two or more processes can be mapped to the same file at the same time, and any write by one process to the file is visible to the other files. You can provide high bandwidth for multi-threaded shared memory by mapping temporary files that disappear when the process exits. But in reality, no two address Spaces are the same, because each process maintains different open files and signals.

Linux memory management system calls

Let’s look at the system call approach to memory management. In fact, POSIX does not specify any system calls for memory management. Linux, however, has its own memory system calls, the main system calls are as follows

The system calls describe
s = brk(addr) Change the data segment size
a = mmap(addr,len,prot,flags,fd,offset) mapping
s = unmap(addr,len) Cancel the mapping

If an error is encountered, s returns -1, a and ADDR are the memory address, len is the length, prot is the control guard bit, flags is the other flag bit, fd is the file descriptor, and offset is the file offset.

BRK specifies the size of a data segment by giving the address of the first byte beyond the segment. If the new value is larger than the original, the data area will become larger and smaller, and vice versa.

The Mmap and unmap system calls control the mapping file. The first parameter, addr, of MMP determines the address of the file map. It must be a multiple of the page size. If the argument is 0, the system assigns an address and returns a. The second parameter is the length, which tells you how many bytes to map. It is also a multiple of the page size. Prot determines the protection bits of the mapping file, which can be marked as readable, writable, executable, or a combination of these. The fourth argument, flags, controls whether the file is private or readable and whether addr is required or just prompted. The fifth argument, fd, is the file descriptor to map. Only open files can be mapped, so to map a file, you must open the file. The last parameter, offset, indicates when the file starts, not necessarily from zero every time.

Linux memory management implementation

The memory management system is one of the most important parts of the operating system. Since the early days of computers, we’ve been using more memory than we actually have in the system. Memory allocation strategies overcome this limitation, and the best known of these is virtual Memory. Virtual memory allows the system to have more memory by sharing it among multiple competing processes. The virtual memory subsystem consists of the following concepts.

Large address space

The operating system makes the system seem to use much more than the actual physical memory, because virtual memory is many times larger than physical memory.

To protect the

Each process in the system has its own virtual address space. These virtual address Spaces are completely separate from each other, so processes running one application do not affect the other. In addition, hardware virtual memory mechanisms allow memory to protect critical memory areas.

The memory mapping

Memory maps are used to map images and data files to the process address space. In memory mapping, the contents of a file are mapped directly into the virtual space of a process.

Fair physical memory allocation

The memory management subsystem allows each running process in the system to fairly allocate the system’s physical memory.

Shared virtual memory

Although virtual memory allows processes to have their own memory space, sometimes you need to share memory. For example, when several processes are running in the shell at the same time, which involves IPC interprocess communication, you need to share memory for information transfer instead of making copies of each process to run independently.

What is virtual memory

Abstract model of virtual memory

Before considering Linux’s approach to supporting virtual memory, it’s useful to consider an abstract model that doesn’t get bogged down in too much detail.

When the processor executes an instruction, it reads the instruction from memory and decodes it. When the instruction is decoded, it retrieves the contents of a certain location and stores it in memory. The processor then moves on to the next instruction. In this way, the processor is always accessing memory to get instructions and store data.

In a virtual memory system, all address Spaces are virtual rather than physical. But it’s the physical addresses that actually store and extract instructions, so you need to have the processor translate virtual addresses into physical addresses based on a table maintained by the operating system.

For simple translation, the virtual and physical addresses are divided into fixed-size blocks called pages. These pages have the same size, and if the page size is different, the operating system will have a hard time managing it. Linux on Alpha AXP systems uses 8 KB pages, while Linux on Intel x86 systems uses 4 KB pages. Each page has a unique number, the page frame number (PFN).

That’s the Linux memory mapping model, where the virtual address consists of two parts: the offset and the virtual page frame number. Each time the processor encounters a virtual address, it extracts the offset and the virtual page frame number. The processor must convert the virtual page frame number to the physical page number, and then access the physical page at the correct offset.

The figure above shows the virtual address space of two processes A and B, each with its own page table. These page tables map virtual pages in the process to physical pages in memory. Each item in the page table contains

  • Valid Flag: indicates whether this page table entry is valid
  • The number of the physical enclosure described by this entry
  • Access control information, how pages are used, whether they are writable and whether code can be executed

To map the virtual address of the processor to the physical address of memory, you first need to calculate the page frame number and offset of the virtual address. The page size is a power of 2 and can be done by shifting.

If the current process attempts to access the virtual address but cannot, this situation is called a page missing exception, and the virtual operating system is notified of the wrong address and the cause of the page error.

By mapping virtual addresses to physical addresses in this way, virtual memory can be mapped to the physical pages of the system in any order.

On-demand paging

Because physical memory is much smaller than virtual memory, operating systems need to be careful not to directly use inefficient physical memory. One way to save physical memory is to load only the pages currently being used by the executing program (isn’t that a lazy loading idea?). . For example, you can run the database to query the database, in which case not all the data is loaded into memory, only the data that needs to be checked. This technique of loading virtual pages in only when needed is called on-demand paging.

exchange

If a process needs to pass a virtual page into memory, but no physical page is available, the operating system must discard another page in physical memory to make room for that page.

If the page has been modified, the operating system must preserve the contents of the page so that it can be accessed later. This type of page is called a dirty page, and when it is removed from memory, it is kept in a special file called a swap file. Access to swap files is slow relative to the speed of the processor and physical memory, and the operating system needs to balance writing pages to disk and keeping them in memory for reuse.

Linux using a least recently used (LRU) page aging technology to fair choose the page may be removed from the system, this scheme involved in the system each page, the age of the page with the changes in the number of access, if a page visited many, then the page said the younger, if a er page access number is too little, The easier it is to swap out the page.

Physical and virtual addressing modes

Most multifunctional processors support the concept of physical and virtual address modes. The physical addressing mode does not require a page table, and the processor does not attempt to perform any address translation in this mode. The Linux kernel is linked to run in the physical address space.

Alpha AXP processors have no physical addressing mode. Instead, it divides the memory space into several regions and specifies two of them as physically mapped addresses. This kernel address space is called the KSEG address space and contains all addresses from 0xFFFFFC0000000000 up. In order for code linked from KSEG (kernel code, by definition) to execute or access data in it, the code must execute in kernel mode. Link to the Linux kernel on Alpha to execute from address 0xFFFFFC0000310000.

Access control

Each entry in the page table also contains access control information that primarily checks whether the process should access memory.

You need to restrict memory access when necessary. For example, memory that contains executable code is read-only; The operating system should not allow a process to write data through its executable code. In contrast, pages containing data can be written, but attempts to execute instructions in that memory will fail. Most processors have at least two modes of execution: kernel mode and user mode. You do not want access to user-executed kernel code or kernel data structures unless the processor is running in kernel mode.

The access control information is stored in the Page Table Entry above, which is the PTE of Alpha AXP. Bit fields have the following meanings

  • V

The bit is valid

  • FOR

Failure at read, failure occurred while attempting to read this page

  • FOW

Error while writing. Error occurred while trying to write

  • FOE

An error occurs during execution. Whenever an instruction in this page is attempted, the processor reports the page error and passes control to the operating system.

  • ASM

Address space match, which is used when the operating system wants to clear certain entries in the translation buffer.

  • GH

Hint used when mapping an entire block using a single transformation buffer entry instead of multiple transformation buffer entries.

  • KRE

Code running in kernel mode can read the page

  • URE

Code in user mode can read the page

  • KWE

Code that runs in kernel mode can be written to a page

  • UWE

Code that runs in user mode can be written to the page

  • Page frame no.

For a PTE with the V-bit set, this field contains the physical page frame number (page frame number) for the PTE. For an invalid PTE, if this field is not zero, it contains information about the page’s location in the interchange file.

In addition, Linux uses two bits

  • _PAGE_DIRTY

If so, the page needs to be written out to the swap file

  • _PAGE_ACCESSED

Linux is used to mark a page as accessed.

The cache

The above virtual memory abstraction model can be implemented, but not very efficiently. Both operating system and processor designers try to improve performance. But in addition to increasing the speed of the processor, memory, etc., the best approach is to maintain a cache of useful information and data to make some operations faster. In Linux, you use a lot of buffers related to memory management, using buffers to improve efficiency.

Buffer cache

The buffer cache contains the buffer of data used by the block device driver.

Remember what a block device is? Just to review this

A block device is a device that can store information in a fixed-size block. It supports reading and (optionally) writing data in a fixed-size block, sector, or cluster. Each block has its own physical address. Usually the block size is between 512 and 65536. All information transferred will be in contiguous blocks. The basic feature of a block device is that each block is relatively opposite and can be read and written independently. Common block devices include hard disks, Blu-ray discs, and USB disks

Block devices generally require fewer pins than character devices.

The buffer cache is used to quickly find blocks of data through the device identifier and block number. If the data can be found in the buffer cache, there is no need to read the data from the physical block device, which is much faster.

Page caching

The page cache is used to speed up access to images and data on disk

It is used to cache the contents of a file one page at a time and can be accessed through the file and the offset in the file. As pages are read from disk into memory, they are cached in the page cache.

Swap cache

Only the changes (dirty pages) are saved in the swap file

As long as these pages are not modified after being written to the swap file, the next time the page is exchanged, there is no need to write it to the swap file because the page is already in the swap file. You can just throw it away. On heavily switched systems, this saves a lot of unnecessary and expensive disk operations.

Hardware cache

A hardware cache is usually used in the processor. Caching of page table entries. In this case, the processor does not always read the page table directly, but instead caches the translations of the pages as needed. These are transform backup buffers, also known as TLBS, that contain cached copies of page table entries from one or more processes in the system.

After referencing the virtual address, the processor will try to find a matching TLB entry. If found, the virtual address can be translated directly to the physical address and the correct action can be performed on the data. If the processor cannot find a matching TLB entry, it gets support and help from the operating system by signaling to the operating system that a TLB loss has occurred. System-specific mechanisms are used to pass this exception to operating system code that can fix the problem. The operating system generates a new TLB entry for the address map. After the exception is cleared, the processor will try again to translate the virtual address. The execution succeeds this time.

The downside of using caches is that to save effort, Linux must spend more time and space maintaining these caches, and if the caches are corrupted, the system will crash.

Linux page table

Linux assumes that page tables are divided into three levels. Each page table accessed contains the next page table

PDG in the figure represents the global page table, and whenever a new process is created, a new page directory, PGD, is created for the new process.

To translate a virtual address to a physical address, the processor must take the contents of each level field, translate it to the offset of the physical page containing the page table, and read the page frame number of the next level page table. This is repeated three times until the box number of the physical page containing the virtual address is found.

Every platform on which Linux runs must provide translation macros that allow the kernel to traverse the page tables of a particular process. This way, the kernel does not need to know the format of the page table entries or how they are arranged.

Page assignment and unassignment

There are many requirements for physical pages in the system. For example, the operating system needs to allocate pages when images are loaded into memory.

All physical pages in the system are described by the MEM_MAP data structure, which is a list of MEM_MAP_T. It includes some important attributes

  • Count: This is the number of users on a page, greater than 1 when the page is shared between multiple processes
  • Age: This describes the age of the page and is used to determine whether the page is suitable for discarding or swapping
  • Map_nr: this is the physical enclosure number described by this mem_map_t.

The page allocation code finds and frees pages using the vector free_area, each element of which contains information about the page block.

Distribution of the page

Linux page assignment uses a well-known partner algorithm to assign and unassign pages. Pages are block-allocated as powers of two. This means that it can allocate 1 page, 2 pages, 4 pages, and so on, as long as there are enough pages available in the system to meet the requirements. The criterion is NR_free_pages > min_free_pages, and if it is, the page block of the desired size is searched in free_area to complete the allocation. Each element of free_area has a mapping of allocated pages and free page blocks for a block of that size.

The allocation algorithm searches for the requested size of the page block. If no block of the requested size is available, a block of twice the requested size is searched and repeated until a block of the free_area is found. If the found page block is larger than the requested page block, the found page block is subdivided until the appropriate size block is found.

Since each block is a power of two, the splitting process is easy because you only have to split the block in half. The free blocks are queued in the appropriate queue and the allocated page blocks are returned to the caller.

If a 2-page block is requested, the first block of 4 pages (starting with the frame on page 4) is split into two 2-page blocks. The first page (starting from the frame on page 4) is returned to the caller as the allocated page, and the second block (starting from the page on page 6) is queued as a free block on page 2 on element 1 of the Free_area array.

Page unallocated

One of the most important consequences of this approach to memory is fragmentation, which divides large free pages into smaller ones. The page unallocation code reassembles pages into larger free blocks as much as possible. Each time a page is released, adjacent blocks of the same size are checked to see if they are free. If so, it is combined with the newly freed page block to form a new free page block for the next page size block. Every time two page blocks are recombined into a larger free page block, the page release code attempts to recombine that page block into a larger free page. In this way, the blocks of available pages will use as much memory as possible.

For example, in the figure above, if you want to free the page on page 1, you combine it with the page frame on page 0 that is already free and queue it as a free block of 2 pages in element 1 of free_area

The memory mapping

The kernel has two types of memory mapping: shared and private. Private is used when the process is read-only and does not write to the file, where private mappings are more efficient. However, any write to a private mapped page causes the kernel to stop mapping pages in that file. Therefore, the write does not change the file on disk and is not visible to other processes accessing the file.

On-demand paging

Once the executable image has been mapped from memory to virtual memory, it can be executed. Because only the beginning of the image is physically pulled into memory, it will quickly access areas of virtual memory where physical memory does not yet exist. The operating system reports this error when a process accesses a virtual address that does not have a valid page table.

Page error describes the virtual address of the page error and the type of memory access (RAM) caused.

Linux must find a VM_AREa_struct structure that represents the area of memory where the page error occurred. Because searching for VM_AREA_struct data structures is critical to efficiently handling page errors, they are linked together in an AVL (Adelson-Velskii and Landis) tree structure. If the virtual address that caused the failure does not have a VM_AREA_struct structure, then the process has accessed an invalid address. Linux sends a SIGSEGV signal to the process, and if the process does not have a handler for the signal, the process will terminate.

Linux then checks for the type of page error that occurred against the type of access allowed in this virtual memory region. It also signals a memory access error if the process accesses memory in an illegal way, such as writing to a read-only region.

Linux has now determined that the page error is legal, so it must be handled.

The file system

In Linux, the most intuitive and visible part is the file system. Let’s take a look at the file system, system calls, and the principles and ideas behind the filesystem implementation in Linux China. Some of these ideas originated with MULTICS and are now used by other operating systems such as Windows. Linux is designed around the idea that Small is Beautiful. Linux provides a powerful and elegant file system, even though it uses only the simplest mechanisms and a few system calls.

Basic concepts of the Linux file system

Linux was originally designed as the MINIX1 file system, which only supported 14-byte file names and its maximum file size was up to 64 MB. The file system after MINIX 1 is the ext file system. The Ext system was much better than MINIX 1 in terms of both the size of bytes and the size of files, but ext was still not as fast as MINIX 1, so Ext 2 was developed to support long file names and large files, with better performance than MINIX 1. This makes it the primary file system for Linux. It’s just that Linux uses VFS to support multiple file systems. When Linux links, users can dynamically mount different file systems onto the VFS.

A file in Linux is a sequence of bytes of any length. A file in Linux can contain any information, such as ASCII codes, binaries, and other types of files are indistinct.

For convenience, files can be organized in a directory, and directories stored as files can be largely handled as files. Directories can have subdirectories to form a hierarchical file system. The root directory under Linux is /, which usually contains multiple subdirectories. The/character is also used to distinguish directory names, such as /usr/cxuan, which refers to the usr directory under the root directory with a subdirectory called cxuan.

Let’s look at the directory names under the root directory of the Linux system

  • /binIt is the important binary application that contains the binary files and the commands used by all users of the system
  • /bootTo start the file that contains the boot loader
  • /dev, including device files, terminal files, USB or any device connected to the system
  • /etc, configuration files, startup scripts, etc., which includes all the configuration files required by the program, as well as startup and shutdown shell scripts to start/stop individual applications
  • /homeAll users use the home directory to store their personal information
  • /lib, system library files, including support for binary libraries located in /bin and /sbin
  • /lost+foundIn the root directory to provide a lost + search system, you must be in the root user to view the contents of the current directory
  • /mediaTo mount removable media
  • /mntTo mount the file system
  • /optProvides an optional application installation directory
  • /proc, a special dynamic directory for maintaining system information and status, including information about currently running processes
  • /root, the main directory folder of the root user
  • /sbin, important binary system files
  • /tmpWhen the system restarts, the files in this directory will be deleted
  • /usrContains applications and files that are accessible to most users
  • /varFrequently changing files such as log files or databases

In Linux, there are two kinds of paths. One is the absolute path. The absolute path tells you to look for files from the root directory. A relative path is also called a working directory.

If /usr/local/books is the working directory, then the shell command

cp books books-replica 
Copy the code

This is the relative path, and this is the relative path

cp /usr/local/books/books /usr/local/books/books-replica
Copy the code

Is the absolute path.

It is common in Linux for one user to use another user’s file or to use files in the file tree structure. Two users share the same file. The file is located in the directory structure of one user. When the other user needs to use the file, the other user must use the absolute path to refer to it. If the absolute path is long, it becomes cumbersome to type each time, so Linux provides a link mechanism.

For example, here’s a diagram before you use links

As shown above, for example, there are two working accounts Jianshe and Cxuan, and Jianshe wants to use the A directory under cxuan’s account, then it might enter /usr/cxuan/A, which is A kind of diagram without using the link.

The following diagram shows the use of the link

Jianshe can now create a link to use the directory under CXuan. ‘

When a directory is created, two directory entries are created at the same time. And.. , the former represents the working directory itself, the latter represents the parent directory of the directory, that is, the directory where the directory is located. In this way, the directory in cXuan is accessed from /usr/jianshe. /cxuan/xxx

The Linux file system is disk-neutral. What does that mean? In general, file systems on one disk are kept separate from each other. If a file system directory wants to access a file system on another disk, you can do this in Windows.

The two file systems are on separate disks and remain independent of each other.

In Linux, where there is mounting support, which allows one disk to be mounted on another disk, the relationship would look like this

Once mounted, the two file systems no longer need to care which disk the file systems are on; the two file systems are visible to each other.

Another feature of the Linux file system is locking. In some applications, two or more processes may use the same file at the same time, which may result in race conditions. One solution is to lock it with different granularity, just to prevent a process from modifying only one row and making the entire file unusable.

POSIX provides a flexible locking mechanism with different levels of granularity, allowing a process to lock a byte or an entire file with a single indivisible operation. The locking mechanism requires the process trying to lock to specify the file it wants to lock, the starting location, and the number of bytes it wants to lock

Linux provides two types of locks: shared locks and mutex locks. If a part of the file already has a shared lock, then adding an exclusive lock will not succeed. If a part of the file system is already a mutex, then any locking before the mutex is removed will not succeed. For the lock to succeed, all bytes of the part that requested the lock must be available.

In the locking phase, the process needs to design the situation after the failure of locking, that is, to determine whether to choose blocking after the failure of locking, if the choice of blocking, then when the lock in the locked process is removed, the process will unblock and replace the lock. If the process chooses non-blocking, the lock is not replaced, and is immediately returned from the system call with a status code indicating success, and the process chooses the next time to try again.

Lock areas can be overlapped. Next we demonstrate locking areas for three different conditions.

As shown in the figure above, A’s shared lock is locked from byte 4 to byte 8

As shown in the figure above, the process has shared locks on both A and B, where 6-8 bytes are overlapping locks

As shown in the figure above, processes A, B, and C have shared locks, so bytes 6 and 7 are shared locks.

If A process attempts to lock the area at the sixth byte, the setting fails and blocks. Since the area is locked by A, B and C, the process cannot lock the area until A, B and C release the lock.

Linux file system calls

Many system calls are related to files and file systems. Let’s first look at system calls to individual files, then to entire directories and files.

To create a new file, the creat method is used, noting that there is no E.

As a side note, Ken Thompson, the founder of UNIX, was once asked what he would do if he had the chance to rewrite UNIX. He replied that he would change creat to CREATE.

The two arguments to this system call are the file name and the protection mode

fd = creat("aaa",mode);
Copy the code

This creates a file named aaa and sets the protection bit for the file according to mode. These bits determine which users are likely to access the file and how.

The creat system call not only creates a file named AAA, it also opens the file. To allow subsequent system calls to access the file, the CREat system call returns a non-negative integer, which is called the file descriptor, which is the fd above.

If a CREat system call is called on an existing file, the contents of that file are cleared, starting with 0. The open system call can also create files by setting the appropriate parameters.

Let’s take a look at the main system calls, as shown in the following table

The system calls describe
fd = creat(name,mode) A way of creating a new file
fd = open(file, …) Open the file to read, write, or read
s = close(fd) Close an open file
n = read(fd, buffer, nbytes) Read data from a file into the cache
n = write(fd, buffer, nbytes) Write data from the cache to a file
position = lseek(fd, offset, whence) Move the file pointer
s = stat(name, &buf) Obtaining file Information
s = fstat(fd, &buf) Obtaining file Information
s = pipe(&fd[0]) Create a pipe
s = fcntl(fd,…) Other operations, such as locking files

In order to read and write a file, you need to open the file first. You must use creat or open to open the file. The parameter is how to open the file, whether it is read-only, read-write or write-only. The open system call also returns the file descriptor. Once the file has been opened, it needs to be closed using the close system call. Close and open always return the minimum number of fd’s unused.

What is a file descriptor? A file descriptor is a number that identifies open files in the computer’s operating system. It describes data resources and how to access them.

When a program asks to open a file, the kernel does the following

  • Granting access
  • inGlobal File TableCreate aEntry (entry)
  • Provide the software with the location of the item

File descriptors consist of unique non-negative integers, and at least one file descriptor exists for every open file on the system. File descriptors were first used in Unix and are used by modern operating systems including Linux, macOS, and BSD.

When a process successfully accesses an open file, the kernel returns a file descriptor that points to an entry in the global file table. This file entry contains the file’s inode information, byte shifts, access restrictions, and so on. For example, see the following figure

By default, the first three file descriptors are STDIN(standard input), STDOUT(standard output) and STDERR(standard error).

The file descriptor for standard input is 0. In terminals, the default is the user’s keyboard input

The file descriptor for the standard output is 1, which in terminals defaults to the user’s screen

The default data flow associated with errors is 2, and in terminals, the default is the user’s screen.

After a brief talk about file descriptors, let’s return to the file system call discussion.

The most expensive file system calls are read and write. Both read and write take three arguments

  • File descriptor: Tells which open file to read and write to
  • Buffer address: Tells where data needs to be read from and written to
  • statistical: tells how many bytes to transfer

That’s all the parameters. It’s a very simple and lightweight design.

While almost all programs read and write files sequentially, some programs need to be able to access any part of a file at random. Associated with each file is a pointer that indicates the current position in the file. When read (or written) sequentially, it usually points to the next byte to be read (or written). If the pointer was at position 4096 before reading 1024 bytes, it will automatically move to position 5120 after a successful reading of the system call.

The Lseek system call changes the value of the pointer position so that subsequent calls to read or write can start anywhere in the file, even beyond the end of the file.

Lseek = lseek

The reason lseek avoids being called SEEK is that SEEK was already used in the search function on previous 16-bit computers.

Lseek takes three arguments: the first is the file descriptor for the file, and the second is the location of the file. The third tells whether the file location is relative to the beginning of the file, the current location, or the end of the file

lseek(int fildes, off_t offset, int whence);
Copy the code

The return value of lseek is the absolute position in the file after changing the file pointer. Lseek is the only system call that never causes a real disk lookup, it just updates the current file location, which is a number in memory.

For each file, Linux keeps track of the file schema (regular, directory, special files), size, last modified time, and other information. Programs can see this information through the STAT system call. The first parameter is the file name, and the second is a pointer to the structure to place the request information. The properties of these structures are shown in the following figure.

A device for storing files
A device for storing files
I – node number
File mode (including protection bit information)
The number of file links
File owner identification
Group to which the file belongs
File size (bytes)
Creation time
Last modification/access time

The fstat call is the same as stat, except that fstat can operate on open files, whereas stat can only operate on paths.

The PIPE file system call is used to create shell pipes. It creates a series of pseudo files that buffer the data between the pipe components and returns file descriptors that read or write to the buffer. In the pipeline, do something like this

Sort the < | in the head - 40Copy the code

The SORT process will print to the file descriptor 1, the standard output, to the pipe, and the head process will read from the pipe. In this way, sort just reads from file descriptor 0 and writes to file descriptor 1 (pipe) without even knowing that they have been redirected. If there is no redirection, sort automatically reads from the keyboard and outputs to the screen.

The final system call is FCNTL, which is used to lock and unlock files, apply shared locks and mutex, or perform some other file-related operation.

Instead of focusing on individual files, let’s focus on system calls related to the overall directory and file system. These system calls are listed below and we’ll take a look at them.

The system calls describe
s = mkdir(path,mode) Create a new directory
s = rmdir(path) Remove a directory
s = link(oldpath,newpath) Create a link to an existing file
s = unlink(path) Unlink the file
s = chdir(path) Changing the working directory
dir = opendir(path) Open a directory to read
s = closedir(dir) Close a directory
dirent = readdir(dir) Reads a directory entry
rewinddir(dir) Turn the directory so that it is used here

You can use mkdir and rmdir to create and delete directories. Note that the directory can be deleted only when it is empty.

When you create a link to an existing file, a directory entry is created. The system calls link to create the link, oldPath to the existing path, newpath to the path to link, and unlink to delete the directory entry. When the last link to the file is deleted, the file is automatically deleted.

You can change the working directory using the chdir system call.

The last four system calls are for reading directories. Like regular files, they can be opened, closed, and read. Each call to readdir returns a directory entry in a fixed format. You cannot write to a directory, but you can use creat or link to create a directory in a folder, or unlink to delete a directory. The user cannot look for a specific file in a directory, but can use rewindir on an open directory, allowing him to read from scratch.

Implementation of the Linux file system

Let’s focus on Virtual File systems. VFS hides from high-level processes and applications the distinction between all the file systems supported by Linux, and whether the file systems are stored on local devices or require access to remote devices over a network. Devices and other special files are associated with the VFS layer. Next, we’ll look at Linux’s first widely distributed file system: ext2. Next, we’ll look at the improvements made to the ext4 filesystem. Various other file systems are also in use. All Linux systems can handle multiple disk partitions, with different file systems on each partition.

Linux virtual file system

To enable applications to interact with different types of file systems on local or remote devices, there are hundreds of file systems in Linux, including EXT3 and EXT4, memorybased RAMFS, TMPFS, network-based NFS, and user-mode FUSE. Fuse, of course, should not be a complete file system, but rather a module that puts the file system implementation into user mode, satisfying the kernel file system interface, which is an implementation of the file system. For these file systems, Linux has a layer of abstraction called the VFS virtual file system,

The following table summarizes the four main file system structures supported by the VFS.

object describe
superblock Specific file system
Dentry Directory entry, an integral part of a path
I-node Specific file
File An open file associated with a process

Superblocks contain important information about the layout of the file system, and the destruction of a superblock can make the entire file system unreadable.

I-node Indicates the index node, which contains the descriptors of each file.

In Linux, directories and devices are also represented as files because they have corresponding i-Nodes

The file system where the superblock and index block reside has corresponding structure on disk.

To facilitate certain directory operations and path traversal, such as /usr/local/cxuan, VFS supports a dentry data structure that represents directory entries. There are a lot of things the dentry data structure (books. Gigatux. Nl/mirror/kern…

Directory entries are cached in the dentry_cache cache. For example, cache entries cache /usr, /usr/local, and so on. If multiple processes hardlink to the same file, their file objects will point to the same entry in this cache.

Finally, the file data structure, which represents both the open file and the memory representation, is created according to the open system call. It supports read, write, sendfile, Lock, and the other system calls we described earlier.

The actual file system implemented under the VFS does not need to use exactly the same abstractions and operations internally. However, they must semantically implement the same file system operations as those specified by the VFS object. The elements of the operational data structure for each of the four VFS objects are Pointers to functionality in the underlying file system.

Linux Ext2 file system

Now let’s take a look at one of the most popular disk file systems in Linux, ext2. The first version of Linux was for the MINIX1 file system, which was limited to a file name size of up to 64 MB. The MINIX 1 file system was permanently replaced by its extension system, Ext, because ext allowed longer filenames and file sizes. Due to ext’s poor performance, ext was replaced by its replacement, ext2, which is still widely used.

An ext2 Linux disk partition contains a file system whose layout is shown below

The Boot block, block 0, is not used by Linux, but is used to load and Boot the computer Boot code. After block 0, the disk partition is divided into groups that are independent of where the disk cylinder boundary is located.

The first block is a superblock. It contains information about the file system layout, including i-nodes, the number of disk blocks, and the start of the list of free disk blocks. The next is a group descriptor, which contains information about the location of the bitmap, the number of free blocks and i-Nodes in the group, and the number of directories in the group. This information is important because ext2 distributes directories evenly across the disk.

The two bitmaps in the diagram are used to record free blocks and free I-nodes, a selection inherited from the MINIX 1 file system, most UNIX file systems use bitmaps instead of free lists. The size of each bitmap is one block. If the size of a block is 1 KB, the number of block groups is limited to 8192 blocks and 8192 I-Nodes. The block size is a strict limit, the number of block groups is not fixed, and in a 4KB block, the number of block groups increases fourfold.

After the superblock, the i-nodes themselves are distributed. The value of i-nodes ranges from 1 to some maximum. Each I-node is a long of 128 bytes. These bytes can describe exactly one file. The I-Node contains statistics (including the owner information available from the stat system call, which is actually read from the I-node) and enough information to find all the disk blocks that hold the file data.

Data blocks follow an I-node. All the files and directories are kept here. If a file or directory contains more than one block, the distribution of the blocks across the disk may or may not be contiguous. In fact, large file blocks can be broken up into many smaller pieces scattered throughout the disk.

I-nodes corresponding to directories are scattered across the disk group. If there is enough space, ext2 organizes ordinary files into the same block group as the parent directory, and data files on the same block into the initial I-Node node. Bitmaps are used to quickly determine where to assign new file system data. When a new file block is allocated, ext2 also preallocates many additional data blocks to the file, which reduces the fragmentation of future writes to the file. This strategy achieves file system load on the entire disk, followed by file fragmentation and defragmentation, and better performance.

To do this, you first use a Linux system call, such as open, which determines the path to the open file. There are two kinds of paths, relative paths and absolute paths. If relative paths are used, the search starts from the current directory; otherwise, the search starts from the root directory.

The file name of a directory file cannot exceed 255 characters. The allocation of the file name is shown in the following figure

Each directory consists of an integer number of disk blocks so that the directory can be written to the disk as a whole. In a directory, the entries for files and subdirectories are unsorted and are next to each other. Directory entries cannot span disk blocks, so there are usually some unused bytes at the end of each disk block.

Each directory entry in the figure above consists of four fixed-length properties and one variable length property. The first attribute is the number of i-Nodes. The i-node number for file first is 19, file second is 42, and directory third is 88. This is followed by the rec_len field, which indicates how many bytes the entry size is, with some extensions after the name, and is used to find the next entry when the name is filled in with an unknown length until the last one is unused. That’s what the arrow means. This is followed by the type field: F for file, D for directory, and finally a fixed-length file name, which is 5, 6, 5, and ends with the file name.

How does the rec_len field extend? See the figure below

As you can see, the middle second is removed, so the field it is in becomes the fill of the first directory entry. Of course, this population can be used as a subsequent directory entry.

Because directories are searched in a linear order, it can take a long time to find directory entries at the end of large files. Therefore, the system maintains a cache of recently accessed directories. The cache uses the file name to look it up, and if the cache hits, the expensive thread search is avoided. Each part of the path holds a dentry object in the directory cache, and subsequent path elements are found through the I-Node until the actual file I-Node is found.

For example, if you want to use an absolute path to find a file, you need to go through the following steps: /usr/local/file

  • First, the system determines the root directory, which typically uses i-node 2, or inode 2, because inode 1 is on the ext2/3 file systemBad blockIndex node. The system places an entry in the dentry cache for future searches of the root directory.
  • Then, look for the string in the root directoryusrTo obtain the i-node number of /usr directory. /usr i-nodes also enter the dentry cache. The node is then taken out and the disk blocks are parsed out of it so that the /usr directory can be read and strings can be looked uplocal. Once this directory entry is found, the directory/usr/localThe i-Node node can be obtained. With the i-Node node number in /usr/local, you can read the I-node and determine the disk block where the directory resides. Finally, look for the file from /usr/local and determine its i-Node number.

If the file exists, the system will extract the I-Node node number and use it as an index to locate the corresponding I-Node in the I-Node node table and load it into memory. I-nodes are stored in the I-Node table. The i-Node table is a kernel data structure that contains i-Node node ids of currently opened files and directories. Here are some i-Node data structures supported by the Linux file system.

attribute byte describe
Mode 2 File properties, guard bits, setuID and setGID bits
Nlinks 2 Indicates the number of directory entries on an I-Node
Uid 2 The UID of the file owner
Gid 2 GID of the file owner
Size 4 File size
Addr 60 Addresses for 12 disk blocks followed by 3 indirect blocks
Gen 1 The code name is added each time the I-Node is reused
Atime 4 The last time the file was accessed
Mtime 4 The last time the file was modified
Ctime 4 The time of the i-Node was recently changed

Now let’s talk about the file reading process. Remember how the read function is called?

n = read(fd,buffer,nbytes);
Copy the code

When the kernel takes over, it starts with these three parameters and user-related information about the internal table. One of the entries in the inner table is an array of file descriptors. The file descriptor array uses file descriptors as indexes and holds one entry for each open file.

The file is related to the i-Node node number. How do you find the i-Node that corresponds to a file using a file descriptor?

One design idea used here is to insert a new table between the file descriptor table and the I-Node node table, called an open-file-Description table. The read/write location of the file exists in the open file descriptor table, as shown in the following figure

We use shell, P1, and P2 to describe the relationship between parent, child, and child processes. The Shell first generates P1, and P1’s data structure is a copy of the Shell, so both point to the same open file descriptor entry. When P1 is finished, the Shell’s file descriptor still points to the open file description at the P1 file location. Then the Shell generates P2, and the new child process automatically inherits the read/write location of the file. Neither P2 nor the Shell knows the exact read/write location of the file.

The two related processes described above are the parent and child processes. If an unrelated process opens a file, it will get its own open file descriptor entry and its own file read/write location, which is what we need.

Thus, opening a file descriptor is equivalent to giving related processes the same read/write location and unrelated processes their own private location.

An I-Node contains the disk addresses of three indirect blocks. Each of the addresses pointing to the disk blocks can store different sizes.

Linux Ext4 file system

To prevent data loss due to system crashes and power failures, the ext2 system must write each data block to disk immediately after it is created, and the latency caused by disk head seek operations is intolerable. Linux relies on journaling file systems for robustness, and ext3 is a journaling file system that is an improvement on ext2, and ext4 is an improvement on ext3, which is also a journaling file system. Ext4 changed ext3’s block-addressing scheme to support larger files and larger file system sizes. Let’s describe the features of the ext4 file system.

The most basic function of a file system with records is to log, which describes the operations of all the file systems in sequence. Changes to file system data or metadata are written out sequentially without the overhead of disk head movement during disk access. Eventually, the change is written and submitted to the appropriate disk location. If the file system breaks down before the change is committed to disk, during the restart, the system detects that the file system was not properly unmounted, and the log is traversed and the log’s record is applied to make the changes to the file system.

The Ext4 file system is designed to be highly compatible with the ext2 and ext3 file systems, despite changes to the Ext4 file system in both the kernel data structure and disk layout. However, a file system can be unmounted from an ext2 file system and successfully mounted to an ext4 file system with proper journaling.

Logs are files that are managed as circular buffers. Logs can be stored on the same or different devices as the primary file system. Journaling Block Device (JBD) is used to perform read and write operations.

There are three main data structures in the JBD: log Records, atomic operations, and transactions. A log entry describes a low-level file system operation that typically results in changes within a block. This is because a system call like write can involve changes in many places – the i-node node, the existing file block, the new file block, the free list, etc. Related log records are grouped atomically. Ext4 notifying the start and end of the system call process enables the JBD to ensure that all or none of the records for atomic operations are applied. Finally, the JBD treats a set of atomic operations as a transaction, mainly for efficiency reasons. Log records within a transaction are stored contiguously. Log records can only be discarded after all the changes have been applied to disk together.

Because of the overhead of journaling each disk, ext4 can be configured to keep journaling all disk changes, or only journaling changes related to file system metadata. Recording metadata alone can reduce overhead and improve performance, but it does not guarantee that file data will not be corrupted. Several other logging systems maintain a series of logs of metadata operations, such as SGI’s XFS.

/proc file system

Another Linux file system is the /proc (process) file system

Its main ideas came from the 8th version of UNIX developed by Bell LABS and later adopted by BSD and System V.

Linux, however, expands on this idea in a few ways. The basic concept is to create a directory in /proc for each process on the system. The name of the directory is the process PID, expressed in decimal notation. For example, /proc/1024 is a directory with a process id of 1024. In this directory are files related to process information, such as the process’s command line, environment variables, and signal masks. In fact, these files do not exist on disk. When this information is needed, the system reads it from the process as needed and returns it to the user in a standard format.

Many Linux extensions relate to other files and directories in /proc. They contain all kinds of information about cpus, disk partitions, devices, interrupt vectors, kernel counters, file systems, loaded modules, and so on. Non-privileged users can read a lot of this information, so they can learn about the system in a secure way.

NFS Network file system

The web has played an important role in Linux since its inception. We’ll look at THE Network File System (NFS), which is used in the modern Linux operating System to link different File systems on different computers into a logical whole.

The NFS architecture

The basic idea behind NFS is to allow arbitrarily selected clients and servers to share a common file system. In many cases, all clients and servers are shared on the same Local Area Network (LAN), but this is not necessary. It could also be the case that clients and servers can run on a WAN if they are far apart. A client can be a server, and a server can be a client, but for the sake of simplicity, we’re talking about clients as consuming services and servers as providing services.

Each NFS service exports one or more directories for remote clients to access. When a directory is available, all of its subdirectories are also available. As a result, the entire directory tree is usually exported as a whole. The list of directories exported by the server is maintained in a file, /etc/exports, which is automatically exported when the server is started. Clients access these exported directories by mounting them. When a client mounts a remote directory, the directory becomes part of the client directory hierarchy, as shown in the following figure.

In this example, client number one was mounted to the server’s bin directory, so it can now use the shell to access /bin/cat or any other directory. Similarly, the client 1 can also be mounted to the no. 2 on the server to access the/usr/local/projects/proj1 or other directories. Client number two can also be mounted to server number two, and the access path is/MNT /projects/proj2.

As you can see above, different clients mount files to different locations in their directory trees, so different clients have different access paths and different names for the same file. Mount points are usually local to the client, and the server is not aware of any of the mount points.

NFS protocol

Since one of NFS’s protocols supports heterogeneous systems, the client and server may be running different operating systems on different hardware, it is necessary to define the interface between the server and client. This will allow any new client to work with the existing server and vice versa.

NFS achieves this by defining two client-server protocols. A protocol is a series of requests sent by a client to a server and a corresponding reply sent by the server back to the client.

The first NFS protocol handles mounts. The client can send a pathname to the server and ask if the server can mount the server’s directory to its own directory level. Because the server does not care where it is mounted, the request does not include the mount address. If the pathname is valid and the specified directory has been exported, the server returns the file handle to the client.

File handles contain fields that uniquely identify the file system type, disk, I node number of the directory, and security information.

Subsequent calls to read and write files in the installed directory or any of its subdirectories use file handles.

When Linux starts, the shell script /etc/rc is run before multiple users. You can write commands to mount the remote file system into this script so that the necessary remote file systems can be automatically mounted before allowing the user to log in. Most Linux versions support automount. This feature supports associating remote directories with local directories.

Automatic mounting has the following advantages over manual mounting to the /etc/rc directory

  • If there is some kind of failure in the listed /etc/rc directory, the client will not be able to start, or the startup will be difficult, delayed, or accompanied by error messages, which will be wasted if the client does not need the server at all.
  • By allowing clients to try a set of servers in parallel, a degree of fault tolerance can be achieved and performance can be improved.

On the other hand, we default that all the optional file systems are the same when we mount automount. Since NFS does not provide support for file or directory replication, users themselves need to ensure that all of these file systems are the same. As a result, most autolomount applies only to binary files and read-only file systems with few changes.

The second NFS protocol is designed for access to files and directories. The client can manipulate directories and read and write files by sending messages to the server. The client can also access file properties, such as file mode, size, and last modified time. NFS supports most Linux system calls, but the open and close system calls are not.

Not supporting open and close is not an oversight, but rather a deliberate design, and there is absolutely no need to open a file before reading it, and no need to close it after reading it.

NFS uses the standard UNIX protection mechanism, using RWX bits to identify owners, groups, and other users. Initially, each request message carries the caller’s groupId and userId, which IS validated by NFS. In fact, it trusts the client not to cheat. You can use a public key cipher to create a security key that is used to authenticate the client and server on each request and reply.

NFS implementations

Even though the client and server code implementation is independent of the NFS protocol, most Linux systems use a three-tier implementation like the one below. The top layer is the system call layer, which handles system calls such as open, read, and close. The second layer, the virtual File system (VFS) layer, is invoked after parsing and parameter checking.

The task of the VFS layer is to maintain a table in which each opened file has an entry. The VFS layer maintains a virtual I node, or V-node, for each open file. The v node is used to indicate whether the file is local or remote. If the files are remote, the V-Node provides enough information for the client to access them. For local files, the file system and the i-node of the file are recorded, because modern operating systems can support multiple file systems. Although VFS was designed to support NFS, VFS is used by modern operating systems with or without NFS.

Linux IO

Now that we’ve looked at processes and threads in Linux and Linux memory management, let’s look at I/O management in Linux.

On Linux systems, as on other UNIX systems, IO management is straightforward and concise. All IO devices are treated as files and are read and written internally using the same read and write methods.

Linux IO basic concepts

Linux also has DISK, printer, network, and other I/O devices, which Linux integrates into the file system as a special file, usually in the /dev directory. These special files can be treated in the same way as regular files.

Special files generally fall into two types:

A block-specific file is a device that can store information in a fixed-size block. It supports reading and (optionally) writing data in a fixed-size block, sector, or cluster. Each block has its own physical address. Usually the block size is between 512 and 65536. All information transferred will be in contiguous blocks. The basic feature of a block device is that each block is relatively opposite and can be read and written independently. Common block devices include hard disks, Blu-ray discs, and USB drives. Block devices usually require fewer pins than character devices.

Disadvantages of block-specific files A block device based on a given solid state memory is slower than byte addressing based on the same type of memory because a read or write must begin at the beginning of the block. So, to read any part of the block, you have to find the beginning of the block, read the whole block, and discard it if you don’t use it. To write a portion of a block, you must find the beginning of the block, read the entire block into memory, modify the data, find the beginning of the block again, and then write the entire block back to the device.

Another class of I/O devices are character-special files. A character device sends or receives a stream of characters in character units, regardless of any block structure. Character devices are not addressable and do not have any seek operations. Common character devices are printers, network devices, mice, and most devices that are different from disks.

Each device – specific file is associated with the device driver. Each driver is identified by a primary device number. If a driver supports multiple devices, a new secondary device number will be added after the primary device. The primary and secondary device numbers together determine the unique driver.

As we know, in a computer system, the CPU does not directly deal with the Device, they have a component called Device Control Unit, such as hard disk disk controller, USB has USB controller, display has video controller, and so on. These controllers are like agents, they know how to respond to the behavior of the hard drive, mouse, keyboard, monitor.

Most character-specific files are not randomly accessible because they need to be controlled in a different way than block-specific files. For example, if you are typing some characters on the keyboard, but you find that you have made a mistake, some people like to use backspace to delete, others like to use del to delete. To interrupt a running device, some systems use Ctrl-U to terminate, but nowadays ctrl-C is generally used.

network

Another concept of I/O is networking, also introduced by UNIX. One of the key concepts in networking is sockets. Sockets allow users to connect to the network, just as mailboxes allow users to connect to the postal system. A schematic diagram of sockets is shown below

The positions of sockets are shown in the figure above, and sockets can be created and destroyed dynamically. After a socket is successfully created, the system returns a file descriptor, which is used to create links, read data, write data, and disconnect data. Each socket supports a specific type of network type, specified at creation time. The most commonly used ones

  • Reliable connection-oriented byte streams
  • Reliable connection-oriented packets
  • Unreliable packet transmission

A reliable connection-oriented byte stream uses pipes to establish a connection between two machines. Being able to guarantee that bytes arrive from one machine to another in sequence, the system can guarantee that all bytes arrive.

The second type is similar to the first except for the boundary between packets. If three writes are sent, the first receiver receives all the bytes directly; In the second method, the recipient receives all the bytes three times. In addition, users can also use a third, unreliable packet transmission. The advantage of this mode of transmission is high performance, which is sometimes more important than reliability. For example, in streaming media, performance is especially important.

There are two forms of transport protocol, TCP and UDP. TCP is a transport control protocol that transmits a reliable stream of bytes. UDP is a user datagram protocol that can transmit only unreliable byte streams. They all belong to the PROTOCOLS in the TCP/IP protocol cluster, and the following is the network protocol layer

As you can see, BOTH TCP and UDP are located at the network layer, so they both use IP protocol, the Internet protocol, as the basis.

Once the socket is established on the source and destination computers, a link can be established between the two computers. When a communicating party uses the LISTEN system call on a local socket, it creates a buffer and blocks until the data arrives. The other party uses the CONNECT system call, and if the other party accepts the CONNECT system call, the connection is established between the two sockets.

When a socket connection is established, it acts like a pipe from which a process can read and write data using the local socket’s file descriptor. The close system call is used to close the connection when it is no longer needed.

Linux I/O system calls

Every I/O device on a Linux system has a special file associated with it. What is a special file?

In an operating system, a special file is a file that is associated with a hardware device in the file system. Special files are also called device files. The purpose of a special file is to expose the device as a file in the file system. Special files provide an excuse for hardware devices that can be accessed by tools for file I/O. Because there are two types of devices, there are also two types of special files, namely character special files and block special files

For most I/O operations, it can be done using only the appropriate files and no special system calls are required. Then, sometimes some device-specific processing is required. Prior to POSIX, most UNIX systems had a system call called IOCTL, which was used to perform a large number of system calls. Over time, POSIX has cleaned this up, dividing iocTL functionality into separate function calls to end-devices, which are now separate system calls.

Here are a few system calls for the administration terminal

The system calls describe
tcgetattr Retrieve attributes
tcsetattr Set properties
cfgetispeed Get input rate
cfgetospeed Get output rate
cfsetispeed Setting the input rate
cfsetospeed Setting the output rate

Linux IO implementation

IO in Linux is implemented through a series of device drivers, one for each device type. Device drivers reserve ports for the operating system and hardware to mask the differences between the operating system and hardware.

When a user accesses a special file, the file system provides the primary device number and secondary device number of the special file and determines whether it is a block special file or a character special file. The primary device number is used to identify the character device or block device, and the secondary device number is used for parameter passing.

Each driver has two parts: both belong to the Linux kernel and both run in kernel mode. The top half runs in the caller context and interacts with the rest of Linux. The second half runs in the kernel context and interacts with the device. The driver can call kernel procedures such as memory allocation, timer management, DMA control, etc. The kernel functions that can be called are documented in the driver-kernel interface.

I/O implementations refer to the implementation of character devices and block devices

Block device implementation

The goal for the block-specific file I/O part of the system is to keep the number of transfers as small as possible. To achieve this, The Linux system sets up a cache between the disk driver and the file system, as shown in the figure below

Prior to Linux kernel 2.2, Linux systems maintained two caches: the Page cache and the buffer cache, so files stored in one disk block might be in both caches. Since version 2.2, the Linux kernel has only one uniform cache — a Generic Block Layer — that fuses these together, enabling the necessary conversions between disks, blocks, buffers, and data pages. So what is the generic data block layer?

The generic data block layer is the component of a kernel that processes requests to all block devices in the system. Generic data blocks have the following functions

By placing the data buffer high in memory, the pages are mapped to the kernel linear address only when the CPU accesses the data and are unmapped thereafter

Implementing a zero-copy mechanism allows disk data to be placed directly into the user-mode address space without having to be first copied to kernel memory

Managing disk volumes treats multiple disk partitions on different block devices as one partition.

Take advantage of the advanced features of the latest disk controllers, such as DMA, etc.

The cache is a powerful tool to improve system performance. No matter what purpose a data block is needed, the system searches the cache first. If the data block is found, the system returns the data block directly, avoiding a single disk access and greatly improving system performance.

If this block is not in the page cache, the operating system will pull the page from disk into memory and then read it into the cache for caching.

Cache supports both read and write operations. To write back a block, a program writes it to the cache first, rather than directly to disk, and then writes it to the cache when the disk cache reaches a certain value.

Linux systems use AN IO scheduler to ensure that there is less repetitive head movement and thus less loss. The I/O scheduler is used to sort read and write operations on block devices and to merge read and write requests. Linux has many variants of the scheduler to meet different job needs. The most basic Linux scheduler is based on the traditional Linux Elevator Scheduler. The main workflow of the Linux elevator scheduler is to sort disk sectors by address and store them in a two-way linked list. New requests will be inserted as a linked list. This method can effectively prevent the head from moving repeatedly. Because elevator schedulers are prone to starvation. Therefore, Linux has modified the original, maintaining two linked lists, and maintaining sorted read and write operations within the deadline. The default read operation takes 0.5s, and the default write operation takes 5s. If the list that has waited the longest by the deadline does not receive the service, it will receive the service first.

Character device implementation

Interaction with character devices is relatively simple. Since character devices generate and use character stream, byte data, support for random access makes little sense. One exception is the use of Line disciplines. A code can be associated with a terminal device, expressed using the TTY_struct structure, which represents the interpreter that exchanges data with the terminal device, but is also part of the kernel. For example: line rules can be edited line, mapping carriage return to line feed and a series of other operations.

What is a line rule?

Industry rules are a layer in some UniX-like systems. Terminal subsystems usually consist of three layers: an upper layer that provides character device interfaces, a lower layer of hardware drivers that interact with hardware or pseudo terminals, and a middle layer of rules that implement common behavior among terminal devices.

Network equipment implementation

Network device interactions are not the same, although network devices also generate character streams because their asynchronous nature makes it difficult to integrate with other character devices on the same interface. The network device driver generates a lot of data packets that are sent to the user application over the network protocol.

Modules in Linux

UNIX device drivers are statically loaded into the kernel. Therefore, device drivers are loaded into memory as soon as the system starts. With the advent of Linux for personal computers, this pattern of static links being used for a while was broken. The number of I/O devices available on PCS has increased by an order of magnitude compared to the number of I/O devices available on minicomputers. Most users don’t have the ability to add a new application, update device drivers, reconnect the kernel, and install.

Linux solves this problem by introducing a loadable Module mechanism. Loadable is a block of code added to the kernel while the system is running.

When a module is loaded into the kernel, several things happen: First, during the loading process, the module is dynamically redeployed. Second, the system checks that the resources required by the program are available. If available, these resources are marked as in use. The third step is to set the desired interrupt vector. Fourth, update the driver conversion table to handle the new master device type. Finally, run the device driver.

When this is done, the driver is installed, and other modern UNIX systems also support loadable mechanisms.

Pay attention to the public number programmer CXUAN reply cXUAN get quality data.

I have written six PDFS of my own, very hardcore, which are linked below

I have written six PDFS of my own, very hardcore, which are linked below

I have written six PDFS of my own, very hardcore, which are linked below

Cxuan wrote four PDFS.

Cxuan added two PDFS.