Larger picture at the end of the article.

This is not an article that teaches you how to create an operating system. Rather, it is a guide to understanding operating systems from several aspects. First you need to know why you are reading this article and why you need to learn about operating systems.

Get a few things straight

The first thing you need to know is why you are learning an operating system. How important is the operating system? What will learning the operating system do for me? Below I will answer for you from these aspects.

Operating system is also a kind of software, but operating system is a very complicated software. Operating systems provide several abstract models

  • File: An abstraction of an I/O device
  • Virtual memory: Abstraction of program memory
  • Process: An abstraction of a running program
  • Virtual machine: Abstraction of the entire operating system

These abstractions are relevant to our daily development. Understanding how the operating system is abstract can cultivate our abstract thinking and development ideas.

Many of these problems are related to the operating system, and the operating system is the basis for solving these problems. If you don’t learn the operating system, you might try to solve it at the framework level, but you don’t know enough about it. When you learn the operating system, you can develop your global thinking.

Learning the operating system we can effectively solve the problem of concurrency, concurrency is almost the top priority of the Internet, which also shows the importance of learning the operating system from the side.

The point of learning operating systems is not to make an operating system from scratch, but to show you how it works. It will give you an understanding of the underlying computer and give you a solid foundation.

I’m sure you know what programming is

Data structures + Algorithms = Programming

Operating system internal will involve a lot of data structure and algorithm description, can let you understand the algorithm based on, let you write better programs.

I think the computer can be compared to a building

The bottom layer of the computer is the foundation of the building, the computer application is the shape of the building, the operating system is telling you how the building is constructed, and writing high-quality software is telling you how to build a stable house.

Understanding the operating system

Before you get to know operating systems, you need to know what a computer system is: a modern computer system consists of one or more processors, main memory, printers, keyboards, mice, monitors, network interfaces, and various input/output devices. These are all hardware categories. We programmers don’t work directly with this hardware, and it’s unlikely that every programmer will know the ins and outs of every computer system.

So the basis of computer scientists in the hardware, installed a layer of software, this software can according to user input commands to achieve the effect of control hardware, and meet the needs of users, the software is called the operating system, the task of it is to the user program to provide a better, more simple, more clear computer model. In other words, the operating system acts as an intermediate layer, providing excuses for the user layer and the hardware, shielding the differences between different applications and hardware, and achieving the role of a unified standard.

At the bottom is the hardware. The hardware includes the chips, circuit boards, disks, keyboards, monitors, and other devices we mentioned above. On top of the hardware is the software. Most computers run in two modes: kernel mode and user mode. The most basic part of software is the operating system, which runs in kernel mode. The operating system has access to the hardware and can execute any instructions the machine can run. The rest of the software runs in user mode.

With an overview of the operating system, let’s take a look at the hardware

Computer hardware

Computer hardware is an important part of the computer, which contains five important components: arithmetic unit, controller, memory, input equipment, output equipment.

  • Arithmetic unitThe main function of computer is to process and calculate data and information. It is the part of the computer that performs arithmetic and various logical operations. The basic operations of an arithmetic unit include addition, subtraction, multiplication, division, shift and other operationsArithmetic&logical UnitThe implementation. The arithmetic unit mainly consists of arithmetic logic unit and register.
  • The controller: refers to the parts that change the main circuit or control circuit according to the specified order. It mainly plays the role of controlling command execution and completes the coordination and command of the operation of the entire computer system. The controller is composed of program counter, instruction register, decoder and so on.

The arithmetic unit and the controller together make up the CPU

  • Memory: Memory is the memory device of a computer. As the name suggests, memory can hold information. There are two kinds of memory: main memory, or memory, which is the main interaction object of the CPU, and external memory, such as hard disks and floppy disks. The following is the storage architecture of modern computer systems

  • Input device: Input device is the device for the computer to obtain external information, it mainly includes the keyboard and mouse.

  • Output device: Output device is a device that presents the information obtained from the input device to the user after a series of calculations. It mainly includes display, printer, etc.

These five parts are also von Neumann’s architecture, which states that a computer must have the following functions:

Send the required programs and data to the computer. Long-term memory of programs, data, intermediate results, and final results is a must. Ability to perform various arithmetical, logical operations, data transmission and other data processing. Able to control the program direction according to the need, and can control the coordination of all parts of the machine according to the instructions. Ability to output processing results to users as required.

The following is a product map of the Intel family, which is a detailed classification of computer hardware. We introduce the hardware involved in the map

  • Bus (Buses)Running throughout the system are collections of electrical pipes called buses that transfer bytes of information back and forth between components. Usually buses are designed to transmit chunks of fixed length, i.eWord (word). The number of bytes in a word (word length) is a basic system parameter that varies from system to system. Most words today are 4 bytes (32 bits) or 8 bytes (64 bits).

  • I/O Devices: Input/Output Devices are the connections between the system and the outside world. In the figure above, there are four types of I/O devices: a keyboard and mouse for user input, a display for user output, and a disk drive for storing data and programs for long periods of time. In the beginning, executable programs are saved on disk.

    Each I/O device connected to the I/O bus is called a controller or Adapter. The main difference between controllers and adapters is encapsulation. A controller is a chipset on the I/O device itself or on the system’s main printed board circuit (usually called the motherboard). The adapter is a card that plugs into a slot on the motherboard. Regardless of the form of organization, their ultimate purpose is to exchange information with each other.

  • Main Memory. Main Memory is a temporary storage device, not permanent storage. Disks are permanent storage devices. Main memory holds both the program and the data processed by the processor to execute the flow. In terms of physical composition, main memory is a set of dynamic random access memory (DRAM). Logically, memory is a linear byte array with its unique address number, starting at 0. Generally, each machine instruction that makes up a program is made up of a different number of bytes, and the size of the data item corresponding to the C program variable varies according to the type. For example, on Linux x86-64 machines, short data takes 2 bytes, int and float 4 bytes, and long and double 8 bytes.

  • A Processor, central processing unit (CPU), or simple Processor is an engine that interprets (and executes) instructions stored in main memory. The processor’s core is a one-word storage device (or register) called a program counter (PC). At any given moment, the PC points to a machine-language instruction in main memory (that is, the address containing the instruction).

    From the time the system is powered on until the system is powered off, the processor continuously executes the instruction pointed to by the program counter, and then updates the program counter to point to the next instruction. The processor operates according to an instruction model defined by its instruction set architecture. In this model, instructions are executed in strict order, and executing an instruction involves executing a series of steps. The processor reads instructions from memory pointed to by the program counter, interprets the bits in the instruction, performs some simple operations that the instruction instructs, and then updates the program counter to point to the next instruction. Instructions may be sequential or dissequential (for example, JMP instructions are not read sequentially)

Here are a few steps where the CPU might perform a simple operation

  • Load (Load): Copies a byte or word from main memory to memory overwriting the previous contents of a register
  • Storage (Store): To copy a byte or word in a register to a location in main storage, overwriting the previous contents of that location
  • Operation (Operate): copies the contents of both registers toALU(Arithmetic logic unit). To perform arithmetic operations on two words and store the result in a register, overwriting the previous contents of the register.

An arithmetic logic unit (ALU) is a combined digital electronic circuit that performs arithmetic and bitwise operations on numeric binary numbers.

  • Jump, jump): Extracts a word from the instruction and copies the word toProgram counter (PC)Overrides the original value

Processes and threads

In terms of processes and threads, you need to understand the key points in this brain map

process

The core concept of an operating system is the process, which is an abstraction of a running program. Everything else in the operating system revolves around processes.

In a multiprocess-processing system, the CPU switches quickly between processes, allowing each program to run for tens or hundreds of milliseconds. However, strictly speaking, the CPU can only run one process at a time, whereas if we set the time to 1 second, it can run multiple processes. This gives us the illusion of parallelism. Because cpus are fast and processes move in and out quickly, it is difficult to keep track of multiple parallel processes. As a result, operating system designers developed a conceptual model for describing parallelism (sequential processes) that made parallelism easier to understand and analyze.

Process model

A process is an instance of an executing program. A process also contains the current values of program counters, registers, and variables. Conceptually, each process has its own virtual CPU, but in practice the CPU switches back and forth between processes.

As shown in the figure above, this is a multi-processing program with four programs, and the program counter changes in different ways as the process changes.

In the figure above, the four programs are abstracted as four processes with their own control flow (that is, each program counter), and each program runs independently. Of course, there is really only one physical program counter, and when each program is run, its logical program counter is loaded into the physical program counter. When the program finishes running, its physical program counter is the real program counter and is then put back into the logical counter of the process.

As you can see from the figure below, after a long enough period of observation, all processes are running, but only one process is actually running at any given moment.

Thus, when we say that a CPU can only really run one process at a time, even if there are 2 cores (or cpus), each core can only run one thread at a time.

Since the CPU switches quickly back and forth between processes, the amount of time each process runs in the CPU is uncertain. And when the same process runs in the CPU again, its running time inside the CPU is often not fixed.

The key idea here is to recognize the conditions required for a process, which is the sum of a particular kind of activity, with programs, inputs, outputs, and states.

Process creation

The operating system needs some way to create processes. Here are some ways to create a process

  • System initialization: When starting the operating system, several processes are typically created.
  • A running program performs a system call that creates the process (such as fork)
  • A user requests the creation of a new process: On many interactive systems, a program can be launched by typing a command or double-clicking an icon. Either of these actions can optionally start a new process. Run X on a basic UNIX system and the new process will take over the window that started it.
  • Initialize a batch job

Technically, in all of these cases, having an existing process execute a process creates a new process by creating a system call. This process can be a running user process, a system process or batch program called from a keyboard or mouse. These are the system calls that create a new process. This system call tells the operating system to create a new process and, directly or indirectly, indicates which program to run in it.

In UNIX, there is only one system call to create a new process, and that system call is fork. This call creates a copy associated with the calling process. After fork, a parent and child will have the same memory image, the same environment string, and the same open file.

In Windows, on the other hand, a simple Win32 function called CreateProcess handles the process creation and loads the correct program into the new process. This call will have 10 parameters, including the need to perform program, input to the program’s command line parameters, a variety of security attributes, on whether the open file to inherit control bits, process, priority information needed to create the window of the specifications and a pointer to a structure, the structure of the newly created process information is returned to the caller. In Windows, the address space of the parent process and the address space of the child process are different from the beginning.

Termination of a process

After a process is created, it starts running and does its job. However, nothing lasts forever, including progress. A process terminates sooner or later, but it is usually triggered by the following

  • Normal exit (voluntary)Most processes terminate when work is completed. When the compiler has finished compiling the given program, the compiler makes a system call to tell the operating system that it is done. This call in UNIX isexitIn Windows, yesExitProcess.
  • Wrong exit (voluntary): For example, if you execute a command that does not exist, the compiler prompts you to exit.
  • Serious error (involuntary)
  • Killed by another process (involuntarily): a process makes a system call telling the operating system to kill a process. In UNIX, the system call is kill. The corresponding function in Win32 isTerminateProcess(Note not a system call).

The hierarchy of processes

In some systems, when a process creates other processes, the parent and child processes are related in some way. The child process itself creates more processes, forming a process hierarchy.

UNIX process system

In UNIX, a process and all its children and their children form a group of processes. When a user sends a signal from the keyboard, the signal is sent to all members of the group of processes currently associated with the keyboard (they are usually all active processes created in the current window). Each process can separately capture the signal, ignore it, or take the default action of being killed by the signal. All processes in the entire operating system belong to a single tree of processes rooted in init.

Windows Process System

In contrast, there is no concept of a process hierarchy in Windows. In Windows, all processes are equal. The only thing that resembles a hierarchy is that when a process is created, the parent process gets a special token (called a handle) that can be used to control the child process. However, this token may also be handed over to another operating system, so there is no hierarchy. In UNIX, a process cannot deprive its children of process rights. (Still, Windows sucks.)

Process status

Although each process is a separate entity with its own program counters and internal state, processes still need to help each other. When a process starts running, it may go through one of these states

There are three states involved in the diagram

  1. Running state, the running state refers to the actual elapsed CPU time slice of the process
  2. The ready state, ready is runnable but ready because other processes are running
  3. Blocking stateThe process cannot run unless some external event occurs

Process implementation

The operating system maintains a table, called the Process Table, to perform the switch between processes. Each process occupies one process table entry. This entry contains important information about the state of the process, including the program counter, stack pointer, memory allocation, status of open files, account information, and scheduling information, as well as other information that must be saved when the process transitions from running to ready or blocked.

The key fields in a typical system are shown below

The first column deals with process management, the second with storage management, and the third with file management.

Now that we have a general understanding of the process table, we can explain more about the illusion of how multiple sequential processes run on a single CPU. Associated with each I/O class is a location called an interrupt vector (a fixed area near the bottom of memory). It contains the entry address of the interrupt service routine. Assuming that user process 3 is running when a disk interrupt occurs, the interrupt hardware pushes the program counter, the program status word, and sometimes one or more registers onto the stack, and the computer jumps to the address indicated by the interrupt vector. That’s what hardware does. Software then takes over the rest of the work.

When the interrupt ends, the operating system calls a C program to handle the rest of the interrupt. After the rest of the work is done, some processes are ready and the scheduler is called to decide which process to run next. Control is then transferred to a piece of assembly language code that loads register values and memory maps for the current process and starts the process running, as shown below for interrupt handling and scheduling.

  1. Hardware push stack program counter etc

  2. Hardware loads the new program counter from the interrupt vector

  3. The assembly language procedure saves the value of a register

  4. Assembly language procedure sets up the new stack

  5. C Interrupts server running (typical reads and cache writes)

  6. The scheduler decides which of the following programs to run first

  7. C procedure returns to assembly code

  8. The assembly language procedure starts running the new current process

A process may be interrupted thousands of times during execution, but the key is that after each interruption, the interrupted process returns to exactly the same state as before the interruption occurred.

thread

In traditional operating systems, each process has an address space and a thread of control. In fact, this is the definition of most processes. However, in many cases, it is common to have multiple threads of control running in the same address space, acting as separate processes. Let’s focus on what is a thread

Use of threads

Maybe this question is also your question, why on the basis of the process to create a thread concept, to be precise, this is actually a process model and thread model discussion, to answer this question, may need three steps to answer

  • The ability of multiple threads to share the same address space and all available data is something that processes don’t have
  • Threads are more important than processesA lightweightBecause threads are lighter, they are easier to create than processes and easier to undo. On many systems, creating a thread is 10-100 times faster than creating a process.
  • A third reason may be a performance argument. If multiple threads are CPU-intensive, there is no performance gain, but if there is a lot of computation and a lot of I/O processing, having multiple threads on top of each other in these activities can speed up the execution of the application

The classic threading model

A process has a thread of execution, often abbreviated as thread. The thread has a program counter to keep track of which instruction to execute next; Threads are actually entities on the CPU that schedule execution.

In the figure below we can see three traditional processes, each with its own address space and a single thread of control. Each thread runs in a different address space

In the figure below, we can see that there is a process with three threads. Each thread runs in the same address space.

Threads are not as independent as processes. All threads in the same process have exactly the same address space, which means they also share the same global variables. Because each thread has access to every memory address in the process address space, one thread can read, write, and even erase another thread’s stack. In addition to sharing the same memory space, threads have the following differences

The content on the left is shared by each thread in the same process, and the content on the right is shared by each thread. That is, the list on the left is a process property and the list on the right is a thread property.

State transitions between threads are the same as between processes.

Each thread has its own stack, as shown in the figure below

Threaded system call

A process typically starts with one of the current single threads, which then creates a new thread by calling a library function (such as thread_create). Thread-created functions require the name of the newly created thread to be specified. The created thread usually returns a thread identifier, which is the name of the new thread.

When a thread has finished its work, it can exit by calling a function, such as thread_exit. Then the thread disappears, the state becomes terminated, and no more scheduling can be performed. During the execution of some threads, one thread can wait for another thread to exit by calling functions such as thread_join. This procedure blocks the calling thread until it waits for a particular thread to exit. In this case, the creation and termination of a thread is very similar to the creation and termination of a process.

Another common thread is the thread_yield call, which allows a thread to automatically abandon the CPU to allow another thread to run. Such a call is important because, unlike a process, a thread cannot use a clock interrupt to force it to give up CPU.

POSIX threads

POSIX threads, commonly called PThreads, are a language-independent execution model, as well as a parallel execution model.

It allows programs to control multiple different workflows that overlap in time. Each workflow is called a thread and can be created and controlled by calling the POSIX Threads API. You can think of it as a standard for threads.

POSIX Threads implementations are available on many similar POSIX-compliant operating systems, such as FreeBSD, NetBSD, OpenBSD, Linux, macOS, Android, Solaris, It implements PThread on top of existing Windows apis.

IEEE is the world’s largest technical professional organization dedicated to the development of technology for the benefit of mankind.

Thread calls describe
pthread_create Create a new thread
pthread_exit Terminates the calling thread
pthread_join Wait for a particular thread to exit
pthread_yield Free the CPU to run another thread
pthread_attr_init Creates and initializes a thread’s property structure
pthread_attr_destory Deletes the attribute structure of a thread

All Pthreads have specific attributes, each containing an identifier, a set of registers (including program counters), and a set of attributes stored in the structure. This property includes stack size, scheduling parameters, and other items needed by the thread.

Thread realize

There are three main implementations

  • Implementing threads in user space;
  • Implementing threads in kernel space;
  • Mix implementation threads in user and kernel space.

Let’s discuss it separately

Implement threads in user space

The first way is to put the entire thread package in user space. The kernel knows nothing about threads; it doesn’t know they exist. All such implementations have the same common structure

Threads run on the run-time system, which is a collection of procedures that manage threads, including the four previously mentioned procedures: pthread_CREATE, pthread_exit, pthread_JOIN, and pthread_yield.

Implement threads in the kernel

When a thread wants to create a new thread or destroy an existing thread, it makes a system call that updates the thread table to complete the creation or destruction of the thread.

The thread table in the kernel holds registers, state, and other information for each thread. This information is the same as thread information in user space, but is placed in the kernel instead of user space. In addition, the kernel maintains a process table to keep track of system state.

All calls that can block are implemented as system calls, and when a thread blocks, the kernel can choose between running another thread in the same process (if there are any ready threads) or running another thread in another process. But in a user implementation, the runtime system runs its own threads until the kernel deprives it of CPU time slices (or there are no runnable threads left).

Hybrid implementation

Combining the advantages of user-space and kernel-space, designers take a kernel-level thread approach and then multiplex user-level threads with some or all of the kernel threads

In this model, the programmer is free to control the number of user threads and kernel threads, with great flexibility. With this approach, the kernel identifies only kernel-level threads and schedules them. Some of these kernel-level threads are multiplexed by multiple user-level threads.

Interprocess communication

Processes need to communicate frequently with other processes. We will discuss Inter Process Communication (IPC) together. Generally speaking, there are six types of communication mechanisms between processes

Let’s outline each of them below

Signal signal

Signaling is the first interprocess communication mechanism used by UNIX systems. Because Linux inherits from UNIX, Linux also supports signaling, which is achieved by sending asynchronous event signals to one or more processes. Signals can be generated from keyboard or access non-existent locations. Signals send tasks to child processes through the shell.

You can type kill -l on your Linux system to list the signals your system uses. Here are some of the signals I’ve provided

A process can choose to ignore incoming signals, but two cannot: SIGSTOP and SIGKILL signals. The SIGSTOP signal tells the currently running process to shut down, and the SIGKILL signal tells the current process that it should be killed. In addition, the process can choose which signals it wants to process, or it can choose to block signals, or if it does not block signals, it can choose to process them itself, or it can choose to do kernel processing. If you choose to hand it over to the kernel for processing, the default processing is performed.

The operating system interrupts the process of the target program to send signals to it. Execution can be interrupted in any non-atomic instruction, if the process has registered a new number handler, then the process is executed, if not, the default processing is used.

Pipeline pipe

Processes on Linux systems can communicate by setting up pipes

Between two processes, a channel can be established into which one process writes byte streams and the other process reads byte streams from the channel. Pipes are synchronous, and when a process tries to read data from an empty pipe, the process blocks until data is available. < span style = “box-sizing: border-box! Important; word-wrap: break-word! Important;

sort <f | head
Copy the code

It creates two processes, sort and head and sort, and sets up a pipe between the two applications so that the standard output of the SORT process is the standard input of the HEAD program. The output generated by the sort process does not need to be written to the file, and if the pipe is full the system stops sort to wait for the head to read the data

Pipeline is actually |, two applications don’t know the existence of pipeline, everything is by the management and control of the shell.

Shared memory Shared memory

Two processes can also communicate with each other through shared memory, where two or more processes can access the common memory space. The shared work of two processes is done through shared memory, and changes made by one process are visible to the other (much like communication between threads).

Before using shared memory, you need to go through a series of invocation processes as follows

  • Create a shared memory segment or use an existing shared memory segment(shmget())
  • Attach a process to the memory segment that has been created(shmat())
  • Detach processes from connected shared memory segments(shmdt())
  • Perform control operations on shared memory segments(shmctl())

First-in, first-out queue FIFO

First-in, first-out queue FIFOS are often referred to as Named Pipes, and Named Pipes work much like regular Pipes, but do have some obvious differences. Unnamed pipes have no backup files: the operating system is responsible for maintaining buffers in memory used to transfer bytes from the writer to the reader. Once the write or output terminates, the buffer is reclaimed and the transferred data is lost. Named pipes, by contrast, have supporting files and unique apis, and they exist in the file system as dedicated files for devices. When all process communication is complete, the named pipe is retained in the file system for later use. Named pipes have strict FIFO behavior

The first byte written is the first byte read, the second byte written is the second byte read, and so on.

Message Queue

You may not know what a message queue means when you hear it. A message queue is used to describe an internal list of links in the kernel addressing space. Messages can be sent sequentially to the queue and retrieved from the queue in several different ways. Each message queue is uniquely identified by an IPC identifier. There are two modes of message queue, one is strict mode, strict mode is like FIFO first in first out queue, messages are sent in order, read in order. There is also a non-strict pattern where the order of the messages is not very important.

The Socket Socket

Another way to manage communication between two processes is to use sockets, which provide end-to-end dual-phase communication. A socket can be associated with one or more processes. Just as pipes have command pipes and unnamed pipes, sockets have two modes. Sockets are generally used for network communication between two processes, and network sockets need support from basic protocols such as TCP (Transmission Control Protocol) or lower level UDP (User Datagram Protocol).

Sockets are classified as follows

  • Sequential Packet Socket: This type of socket provides reliable connections for datagrams of fixed maximum length. The connection is bidirectional and sequential.
  • Datagram Socket: Packet socket supports bidirectional data streaming. Packet sockets may receive messages in a different order than the sender.
  • Stream Socket: Streaming sockets work like a telephone conversation, providing a reliable two-way flow of data.
  • Raw Socket: Basic communication protocols can be accessed using raw sockets.

scheduling

When a computer is a multi-programming system, there are frequently many processes or threads competing for CPU slices at the same time. This happens when two or more processes/threads are in a ready state. If only one CPU is available, then you must choose which process/thread can run next. There is a role in the operating system called a scheduler that does just that, using an algorithm called the scheduling algorithm.

Classification of scheduling algorithms

There is no doubt that different scheduling algorithms are needed in different environments. This happens because different applications and different operating systems have different goals. In other words, scheduler optimization is different in different systems. It is necessary to distinguish three environments

  • Batch (Batch): Business
  • Interactive (Interactive): Interactive user environment
  • Real time (Real time)

Scheduling in batch processing

Now let’s switch our focus from general scheduling to specific scheduling algorithms. Next we will look at scheduling in batch processing.

First come, first served

The simplest design of a non-preemptive scheduling algorithm is first-come,first-server. When the first task enters the system from outside, it starts immediately and allows it to run for any length of time. It will not be interrupted by running too long. When other jobs come in, they go to the end of the ready queue. When the running process blocks, the first process in the wait queue starts running. When a blocked process is back in the ready state, it will appear as a newly arrived task at the end of the queue, at the end of all processes.

The power of this algorithm is that it is easy to understand and program. In this algorithm, a single linked list keeps track of all ready processes. To select a process to run, simply remove a process from the head of the queue. To add a new job or block a process, simply append the job or process to the end of the queue. This is a very simple implementation.

Shortest job first

In batch processing, the second scheduling algorithm is Shortest Job First, and we assume that the running time is known. An insurance company, for example, can predict with considerable accuracy how long it will take to process a batch of 1,000 claims because it does similar work every day. When several equally important jobs are started in the input queue, the scheduler should use the shortest first job algorithm

It is important to note that the shortest job first algorithm is optimal when all processes are running.

Minimum remaining time is preferred

The Shortest job first preemptive version is known as the Shortest Remaining Time Next algorithm. With this algorithm, the scheduler always selects the process with the shortest remaining running time to run.

Scheduling in interactive systems

Interactive systems are so common in PCS, servers, and other systems that it’s worth exploring interactive scheduling

Polling scheduling

One of the oldest, simplest, fairest and most widely used algorithms is the round robin algorithm. Each process is assigned a period of time, called a quantum, within which the process is allowed to run. If the process is still running when the time slice ends, it preempts one CPU and allocates it to another process. If the process blocks or ends before the time slice ends, the CPU switches immediately. The polling algorithm is relatively easy to implement. What the scheduler does is maintain a list of runnable processes, like a in the image below, that are moved to the end of the queue when a process runs out of time, like B in the image below.

Priority scheduling

Polling scheduling assumes that all processes are equally important. But that may not be the case. For example, in a university hierarchy, first the dean, then the professors, secretaries, support staff, and finally the students. This consideration of external circumstances enables priority scheduling

The basic idea is clear: each process is assigned a priority, and the process with the highest priority runs first.

Multi-level queue

The earliest System to use priority scheduling was CTSS(Compatible TimeSharing System). CTSS needs to swap the current process out to disk and read a new process from disk before each switch. It is more efficient to set long time slices for CPU-intensive processes than to frequently give them short time slices (reducing the number of swaps). On the other hand, as mentioned earlier, the response time can be affected by long chip processes, and the solution is to set priority classes. The process with the highest priority runs one time slice, the process with the next highest priority runs two time slices, the process with the next highest priority runs four time slices, and so on. When a process runs out of allocated time slices, it is moved to the next category.

Shortest process first

The shortest process takes precedence over the one that executes the shortest estimated running time based on the process’s past behavior. Assuming that the estimated running time of each command on each terminal is T0, now assuming that its next run time is measured to T1, the estimated time can be improved by weighting two values, namely aT0+ (1-1)T1. By choosing the value of A, you can decide whether to forget the old running times as soon as possible or to keep them in mind for a long time. When a is equal to 1/2, you get the following sequence

! [image-20200220120452410](/Users/mr.l/Library/Application Support/typora-user-images/image-20200220120452410.png)

As you can see, after three rounds, T0’s share of the new estimate has dropped to 1/8.

Ensure that scheduling

A completely different scheduling approach is to make explicit performance guarantees to users. A practical and easily implemented guarantee is that if a user is working with n users logged in, each user will get 1/n of the CPU processing power. Similarly, in a single-user system with n processes running, if all processes are equivalent, each process will get 1/n of CPU time.

Lottery scheduling

Making promises to users and then delivering on them is a good thing, but hard to do. But there is a simple way, there is an algorithm that can not only give the prediction results but also has a relatively simple way of implementation, is lottery scheduling algorithm.

The basic idea is to provide a lottery of various system resources, such as CPU time, for processes. When a scheduling decision is made, a lottery ticket is randomly drawn, and the process that owns the lottery ticket gets the resource. When applied to CPU scheduling, the system can hold up to 50 sweepstakes per second, with each winner receiving, say, 20 milliseconds of CPU time as a reward.

Fair share scheduling

So far, we have assumed that each process is being scheduled, regardless of who owns the process. As a result, if user 1 starts nine processes and user 2 starts one, using rotation or the same priority scheduling algorithm, user 1 will get 90% of the CPU time and user 2 will get 10% of the CPU time.

To prevent this, some systems take process owners into account before scheduling. In this model, each user allocates some CPU time, and the scheduler selects processes and forces them to execute. So if two users are each guaranteed a 50% CPU slice, then no matter how many processes a user has, they will get the same CPU share.

Scheduling in real-time systems

Real-time system is a system in which time plays an important role. Real-time systems can be divided into two categories: hard real time systems, which must meet absolute deadlines, and soft real time systems, which must meet absolute deadlines. The latter means you don’t want to miss deadlines occasionally, but you can tolerate them.

Events in a real-time system can be further classified by response as either periodic (occurring at regular intervals) or non-periodic (occurring at unpredictable times) events. A system may respond to multiple periodic streams of events, and depending on how long it takes to process each event, it may not even be able to process all of them. For example, if there are m cycle events, event I occurs at cycle Pi, and it takes Ci seconds of CPU time to process an event, then the load can be handled under the condition that

Only real-time systems that meet this condition are called schedulable, which means that it can actually be implemented. A process that does not meet this test cannot be scheduled because the total amount of CPU time required by these processes is greater than the CPU can provide.

Here’s what you need to know about memory management

Address space

If you want multiple applications running in memory at the same time, you must solve two problems: protection and relocation. The first solution is to mark a block of memory with a guard key and compare the key for the execution of the procedure with the key extracted for each stored word. This only solves the first problem (breaking the operating system), but not the problem of multiple processes running in memory at the same time.

A better approach is to create a storage abstraction: the Address space. Just as the concept of a process creates an abstract CPU to run programs, address Spaces create an abstract memory for programs to use.

Base address register and address change register

The simplest approach is to use dynamic relocation, which maps the address space of each process to a different region of physical memory in a simple way. Another way is to use the base address register and the indexed register.

  • Base address register: The starting location of data storage memory
  • Indexing register: Stores the length of an application.

Each time a process references memory to obtain instructions or read or write data, the CPU automatically adds the base address value to the address generated by the process and then sends it to the memory bus. At the same time, it checks whether the address provided by the program is greater than or equal to the value in the indexing register. If the program provides an address that exceeds the range of the indexing register, an error is generated and access is aborted.

Switching technology

In the process of running the program, the problem of insufficient memory often occurs.

One of the considerations is the Cleanup technique, which takes a process completely into memory and then runs in memory for a while before putting it back on disk. Idle processes are stored on disk, so they don’t take up much memory when they’re not running. Another strategy is called virtual memory, a technology that allows parts of an application to run in memory. So let’s talk about interchange first

Exchange process

Here’s an exchange

At the beginning, only process A is in memory, then processes B and C are either created or swapped into memory from disk, then in Figure D, A is swapped out of memory to disk, and finally A comes back in. Because process A in Figure G is now in A different location, it needs to be relocated during loading or executed by software when switching programs; Or by hardware relocation during program execution. Base address registers and address change registers are suitable for this case.

Swapping creates multiple free areas (holes) in memory, and memory moves all free areas as far down as possible to merge them into one large free area. This technique is called a memory compaction. But this technique is usually not used because it consumes a lot of CPU time.

Free memory management

When memory is allocated dynamically, the operating system must manage it. Roughly speaking, there are two ways to monitor memory usage

  • The bitmap (bitmap)
  • Free Lists

Storage management using bitmaps

With the bitmap approach, memory may be divided into allocation units as small as a few words or as large as a few thousand bytes. Each allocation unit corresponds to a bit in the bitmap, with 0 indicating free and 1 indicating occupied (or vice versa). A memory region and its corresponding bitmap are shown below

Bitmaps provide an easy way to track memory usage in fixed-size memory, because the size of a bitmap depends on the size of memory and allocation units. One problem with this approach is that when deciding to put a process with K allocation units into memory, the Memory Manager must search the bitmap for strings capable of running k consecutive 0 bits. The downside of bitmaps is that finding consecutive zeros of specified length in a bitmap is a time-consuming operation. (This can be simply understood as finding a large number of free array elements in a chaotic array.)

Use linked lists for management

Another way to keep track of memory usage is to maintain a linked list of allocated memory segments and free memory segments, which may contain free areas for processes or both processes. You can use Figure C above to represent memory usage. Each item in the list can represent a free area (H) or process (P) start flag, length, and the location of the next item.

When storing processes and free areas in a linked list by address, there are several algorithms that can allocate memory for processes that are created (or swapped in from disk). Assuming that the memory manager knows how much memory to allocate, the simplest algorithm is to use first Fit. The memory manager scans along the segment list until it finds a large enough free area. Unless the size of the free area is the same as the size of the space to be allocated, the free area is divided into two parts, one for the process; Some generate new free areas. The first fit algorithm is a fast algorithm because it searches the linked list as much as possible.

A small variation of the first fit is next fit. It works in the same way as the first match, except that the next match records its location each time it finds a suitable free area, so that the next time it finds a free area it starts where it left off, rather than starting from the beginning like the first match algorithm does.

Another well-known and widely used algorithm is Best Fit. The best fit is to go through the list and find the smallest free area that can accommodate the process.

Virtual memory

Although the base and address registers are used to create an abstraction of the address space, there is another problem that needs to be addressed: managing bloatware. The basic idea of virtual memory is that each program has its own address space, which is divided into blocks called pages. Each page is a contiguous range of addresses. These pages are mapped to physical memory, but not all pages must be in memory to run the program. When a program references a portion of the address space in physical memory, the hardware immediately performs the necessary mapping. When a program references a portion of the address space that is not in physical memory, it is the responsibility of the operating system to load the missing portion into physical memory and re-execute the failed instruction.

paging

A technique called paging is used in most systems that use virtual memory. On any computer, a program references a set of memory addresses. When the program executes

MOV REG,1000
Copy the code

This instruction copies the contents of the memory location 1000 into the REG (or vice versa, depending on the computer). Addresses can be generated by index, base address register, segment register, or other means.

The addresses generated by the program are called virtual addresses and form a virtual address space. On a computer without virtual memory, the system sends the virtual addresses directly to the in-memory line. Read and write operations use the physical memory of the same address. When virtual memory is used, the virtual address is not sent directly to the memory bus. Instead, virtual addresses are mapped to physical Memory addresses using the Memory Management Unit (MMU), as shown in the figure below

The picture below shows how this mapping works. Right

The page table maps virtual addresses to physical memory addresses. Each page starts at a multiple of 4096 and ends at 4095, so 4K to 8K is actually 4096-8191, and 8K-12K is 8192-12287

In this example, we might have a computer with 16 bit addresses ranging from 0 to 64 K-1. These are virtual addresses. However, there is only 32 KB of physical address. So while 64 KB programs can be written, they cannot all be brought into memory to run. A complete copy of the core image of the program, at most 64 KB, must be kept on disk to ensure that the program fragments can be brought into memory when needed.

A page table

The virtual page number can be used as the index of the page table to find the contents of the virtual page. You can find the page frame number (if any) from the page table entry. The frame number is then concatenated to the high end of the offset to replace the virtual page number and form the physical address.

Therefore, the purpose of the page table is to map virtual pages to page frames. Mathematically, a page table is a function that takes a virtual page number and results in a physical page frame number.

This function converts the virtual pages in the virtual address into a page frame to form a physical address.

Structure of the page table entry

Now let’s talk about the structure of the page entry. You know the general structure of the page entry, which is composed of the box number and in/out position. Now let’s talk about the structure of the page entry

The structure of the page entry is machine-specific, but the page entry is roughly the same on different machines. Above is the composition of a page table entry. The page table entries may vary from computer to computer, but generally they are 32 bits. The most important field in a Page entry is the Page Frame number. After all, the most important step in the page-table to page-box operation is to map this value across. The next important bit is in/out. If the value of this bit is 1, the page table entry is valid and can be used. If this value is 0, it indicates that the virtual page corresponding to the page entry is not in memory, and accessing this page causes a page fault.

Protection tells us which access is allowed. In its simplest form, the field has only one bit, 0 for readable and writable and 1 for read-only.

Modified and Referenced track page usage. When a page is written, the hardware automatically sets the change bit. The modify bit is useful when reassigning page frames. If a page has been modified (that is, it is dirty), it must be written back to disk. If a page has not been modified (that is, it is clean), the page box is discarded when reassigned because the copy on disk is still valid. This bit is sometimes called the dirty bit because it reflects the state of the page.

Referenced The access bit is set when a page is visited, whether it is read or written. This value helps the operating system select pages to be culled in the event of a missing page interrupt. Pages that are no longer in use are better suited for elimination than pages that are in use. This bit is very useful in the page replacement algorithm discussed later.

The last bit is used to disable the page from being cached, which plays a key role in mapping to device registers or memory. This bit is used to disable caching. This bit is not required for machines that have a separate I/O space rather than memory-mapped I/O.

Page replacement algorithm

Let’s take a look at what page replacement algorithms are available.

Optimal page replacement algorithm

The optimal page replacement algorithm works as follows: When a page miss interrupt occurs, one of these pages will be referenced on the next instruction (the page containing that instruction). Other pages may not be accessed until 10, 100, or 1000 instructions later. Each page can be marked with the number of instructions to execute before the page is first accessed.

The optimized page algorithm indicates that the largest page should be marked. If a page will not be used for 8 million instructions and another page will not be used for 6 million instructions, the previous page will be replaced to defer the interruption of missing pages that need to be called to that page. Computers, like humans, put things off for as long as possible.

The biggest problem with this algorithm is that it can’t be implemented. When a page-missing interrupt occurs, the operating system has no way of knowing when each page will be accessed next. This algorithm is not used in practice at all.

Page replacement algorithms have not been used recently

To enable the operating system to collect information about page usage, most computers that use virtual addresses have two status bits, R and M, associated with each page. R is set every time a page is referenced (read or written) and M is set every time a page is written (that is, modified). These bits are contained in each page table entry, as shown below

Because these bits are updated with each access, it is important that the hardware set them. Once a bit is set to 1, it remains 1 until the next time the operating system changes the bit.

If the hardware does not have these bits, you can simulate them using the operating system’s page-miss interrupt and clock interrupt mechanisms. When starting a process, mark all of its pages as out of memory. Any access to any page raises a page miss interrupt, at which point the operating system can set the R bit (in its internal table), modify the page entry to point to the correct page, set it to READ ONLY mode, and then restart the instruction that caused the page miss interrupt. If the page is subsequently modified, another missing page exception occurs. This allows the operating system to set M bits and set the page’s mode to READ/WRITE.

A simple page replacement algorithm can be constructed using R and M bits: when a process is started, the operating system sets both bits of all its pages to 0. R bits are periodically cleared (interrupted at each clock). Used to separate recently unreferenced pages from referenced pages.

When a page miss interrupt occurs, the operating system examines all pages and classifies the current value into four categories based on their R and M bits:

  • Class 0: no reference to R, no modification to M
  • Class 1: no reference to R, modified M
  • Class 2: reference R without modifying M
  • Class 3: accessed R, modified M

Although it may seem as if the first type of pages cannot be implemented, they occur when the R bits of the third type of pages are cleared by a clock interrupt. Clock interrupts do not clear M bits, because this information is needed to know whether to write back to disk. Clearing R but not M results in a category of pages.

The NRU(Not Recently Used) algorithm randomly deletes a page from the least numbered non-empty class. The idea behind this algorithm is that it is better to eliminate a modified but unvisited page in a clock (about 20 ms) than a heavily referenced unmodified page. The main advantage of NRU is that it is easy to understand and can be implemented effectively.

Fifo page replacement algorithm

Another less expensive approach is to use a FIFO(first-in, first-out) algorithm, a type of data structure that can also be used In page replacement algorithms. The operating system maintains a linked list of all the pages currently in memory, with the earliest entries placed at the top and the latest entries placed at the bottom. When a missing page exception occurs, the header page is removed and a new page is added to the end of the table.

Second chance page replacement algorithm

The FIFO linked list page we have learned above has a defect, that is, out and in the chain does not check, so it will be easy to replace frequently used pages out, in order to avoid this problem, we make a simple change to the algorithm: We check the R bit of the oldest page. If it is 0, then the page is the oldest and not being used, and the page will be swapped out immediately. If the R bit is 1, the bit is cleared and the page is placed at the end of the list, changing its load time as it was just put in. Then keep searching.

This algorithm is called the second chance algorithm, and as we see below, pages A through H remain in the linked list, sorted by the time they arrived in memory.

A) pages arranged on a first-in, first-out basis; B) page linked list when abnormal interruption of page missing occurs at time 20 and R bit of A has been set.

Suppose the page missing exception occurs at time 20, when the oldest page is A, which arrives at time 0. If the R bit of A is 0, it will be flushed out of memory, either written back to disk (if it has been modified), or simply discarded (if it has not been modified). On the other hand, if its R bit is already set, put A at the end of the list and reset the load time to the current moment (20), then clear the R bit. Then continue to search for suitable pages starting from page B.

Looking for a second chance are pages that have not been visited in the most recent clock interval. If all pages are visited, the algorithm is reduced to a pure FIFO algorithm. In particular, assume that all pages in Figure A are set to R bits. The operating system moves pages to the end of the list, each time clearing the R bit as it is added to the end. Finally, the algorithm will return to page A, at this time the R bit has been cleared, then page A will be executed out of the chain processing, so the algorithm can end normally.

Clock page replacement algorithm

A good way to do this is to keep all the pages in a circular list, like a clock face, with one hand pointing to the oldest page. As shown in the figure below

When the missing page error occurs, the algorithm first checks the page pointing to the needle, if its R bit is 0, the page is eliminated, and the new page is inserted into this position, and then the needle forward one; If the R bit is 1, clear the R bit and move the hand forward one position. Repeat this process until you find a page location with R bit 0. Knowing how this algorithm works, you can see why it is called the clock (CLOKC) algorithm.

Least recent use of page replacement algorithms

Pages that are frequently used in the previous instructions and may be used in the later instructions. Conversely, pages that have not been used for a long time may not be used for some time to come. This idea reveals an algorithm that can be implemented to replace the page that has not been used for the longest time when a missing page is interrupted. This policy is called LRU(Least Recently Used), and the Least recent page replacement algorithm is Used.

Although LRU is theoretically possible, it is costly in the long run. To fully implement LRU, a linked list of all pages is maintained in memory, with the most frequently used pages at the head of the table and the least recently used pages at the bottom. The hard part is updating the entire list with each memory reference. Finding a page in the list, deleting it, and then moving it to the head of the table is a time-consuming operation, even with hardware.

Simulate LRU with software

Although the ABOVE LRU algorithm is feasible in principle, few machines have those special hardware. This is the hardware implementation, so now consider implementing LRU in software. One possible solution is the NFU(Not Frequently Used, least common) algorithm. It requires a software counter associated with each page, initialized to 0. At each clock interrupt, the operating system scans all the pages in memory and adds each page’s R bit (0 or 1) to its counter. This counter roughly tracks the frequency of visits to individual pages. When a missing page exception occurs, the page with the smallest counter value is replaced.

To make the NFU emulate the LRU, you only need to make a simple modification, which has two steps

  • First, move the counter one bit to the right before R bit is added;
  • In the second step, the R bit is added to the leftmost bit instead of the rightmost bit.

The modified algorithm is called the aging algorithm, and the figure below illustrates how the aging algorithm works.

We assume that the R bits of pages 0-5 in the first clock cycle are 1, 0, 1, 0, 1, 1, (that is, page 0 is 1, page 1 is 0, page 2 is 1, and so on). That is, between zero clock cycles and one clock cycle, 0, 2, 4, 5 are all referenced, thus setting their R bits to 1 and the rest to 0. R bits are added to the left after the associated six counters have been moved to the right, like a in the figure above. The remaining four columns show the six counter changes over the next four clock cycles.

The CPU is advancing at a frequency called the clock tick or clock cycle. A 100Mhz processor will receive 100,000,000 clock ticks per second.

When a missing page exception occurs, the page with the smallest counter value is replaced (that is, removed). If a page has not been visited for the previous four clock cycles, its counter should have four consecutive zeros, so its value must be smaller than that of a page that has not been visited for the previous three clock cycles.

There are two important differences between this algorithm and the LRU algorithm: look at e, the third and fifth columns in the figure above

Working set clock page replacement algorithm

When a page missing exception occurs, it is necessary to scan the entire page table to determine the eliminated page, so the basic working set algorithm is still a waste of time. An improvement over the basic working set algorithm is based on the clock algorithm but uses the working set information, called WSClock(Working set clock). Because of its simple implementation and high performance, it is widely used in practice.

As with the clock algorithm, the required data structure is a circular list of page boxes as elements, as shown below

The operation of clock page replacement algorithm of working set is as follows: a) and b) give the situation that occurs when R = 1; C) and D) give examples of R = 0

Initially, the table was empty. When the first page is loaded, load it into the table. As more pages are added, they form a circular structure. Each entry contains the last use time from the base working set algorithm, along with R bits (indicated) and M bits (not indicated).

As with the clock algorithm, the page to which the pointer points is first checked for each missing page exception. If the R bit is set to 1 and the page has been used in the current clock cycle, then the page is not suitable for elimination. Then set the R position of the page to 0 and the pointer to the next page, and repeat the algorithm. The serialized state of the event is shown in Figure B.

Now consider what happens when the page to which the pointer points is R = 0. See Figure C. If the page is older than t and has been accessed, then the page is not in the working set and a copy of the page is on disk. Request to reload a new page and place the new page in it, as shown in Figure D. On the other hand, if a page has been modified, it cannot be reapplied because there is no valid copy of the page on disk. To avoid process switching due to scheduled write operations, the pointer moves on and the algorithm moves on to the next page. After all, it is possible to have an old, unmodified page that can be used immediately.

In principle, all pages can be scheduled due to disk I/O in a certain clock cycle. To reduce disk blocking, you need to set a limit of n pages written back. Once this limit is reached, no new write operations are allowed to be scheduled.

So the question is, the pointer will go around and come back to the origin, and if it goes back to the origin, what happens to its starting point? There are two cases:

  • At least one write operation is scheduled. Procedure
  • No write operation is scheduled

In the first case, the pointer simply moves around looking for a page that hasn’t been modified. Because one or more writes have been scheduled, eventually a write completes and its page is marked as unmodified. The first unmodified page encountered by a replacement is not necessarily the first page to be scheduled for write operations, as the hard disk driver may reorder the writes to optimize performance.

In the second case, all pages are in the working set, otherwise at least one write operation would be scheduled. Due to the lack of additional information, the simplest way is to replace an unmodified page to use. The location of the unmodified page needs to be recorded in the scan. If no unmodified page exists, the current page is selected and written back to disk.

Page replacement algorithm summary

We have now studied a variety of page replacement algorithms, now we come to a simple summary, the summary of the algorithm is summarized as follows

algorithm annotation
The optimal algorithm Not implementable, but can be used as a benchmark
NRU(not recently used) algorithm It’s very similar to the LRU algorithm
FIFO(First in, first out) algorithm Important pages may be discarded
Second chance algorithm Compared with FIFO has a great improvement
The clock algorithm The actual use
LRU(least recent) algorithm Good, but hard to achieve
NFU(least frequently consumed) algorithm Very similar to LRU
Aging algorithm Efficient algorithm for LRU approximation
Working set algorithm It’s expensive to implement
Working set clock algorithm A more efficient algorithm
  • The optimal algorithm replaces the last page to be visited in the current page. Unfortunately, there is no way to determine which page will be the last to visit, so this algorithm can’t actually be used. However, it can be used as a yardstick against which other algorithms can be measured.

  • The NRU algorithm classifies the page atmosphere into four categories according to the states of R and M bits. Select a page at random from the least numbered category. The NRU algorithm is easy to implement, but not very good performance. There are better algorithms.

  • The FIFO tracks the order in which pages are loaded into memory and places them in a linked list. It is possible to delete the oldest pages that are still in use, so this algorithm is not a good choice either.

  • The second chance algorithm is a modification to the FIFO that checks to see if a page is still in use before deleting it. If the page is in use, it is reserved. This improvement greatly improves performance.

  • The clock algorithm is another implementation of the second chance algorithm. The clock algorithm performs about the same as the second chance algorithm, but takes less time to execute the algorithm.

  • LRU algorithm is a very good algorithm, but it is difficult to implement without special hardware (TLB). If you don’t have hardware, you can’t use the LRU algorithm.

  • NFU algorithm is similar to LRU algorithm, but its performance is not very good.

  • The aging algorithm is a closer implementation of the LRU algorithm and can be better implemented, so it is a good choice

  • The last two algorithms use the working set algorithm. The working set algorithm provides reasonable performance overhead, but its implementation is complicated. WSClock is another variant that not only provides good performance, but can be implemented efficiently.

In short, the best algorithms are the aging algorithm and WSClock algorithm. They are based on LRU and working set algorithms respectively. They all have good performance and can be implemented effectively. There are other good algorithms out there, but actually these two are probably the most important.

Here’s what you need to know about file systems

file

The file name

A file is an abstraction mechanism that provides a way to store information and read it later. Perhaps the most important feature of any mechanism is the way managed objects are named. After you create a file, it gives it a name. When a process terminates, the file continues to exist and can be accessed by other processes using the name.

File naming rules vary from operating system to operating system, but all modern operating systems allow strings of 1 to 8 letters as legal filenames.

Some files are case-sensitive, but most are not. UNIX falls into the first category; The venerable MS-DOS falls into the second category (by the way, mS-DOS is still very widely used in embedded systems despite its long history, so it is by no means obsolete); Therefore, UNIX systems have three different named files: Maria, Maria, and Maria. In MS-DOS, all these names belong to the same file.

Many operating systems support two-part filenames separated by a., such as the file name prog.c. The file after the origin is called a file extension, and a file extension usually represents some information about a file. The following figure shows some common file name extensions and their meanings

extension meaning
bak Backup file
c C Source program files
gif Image files that conform to the graphics interchange format
hlp The help file
html WWW Hypertext Markup Language documentation
jpg Still images that conform to the JPEG encoding standard
mp3 Conforms to the MP3 audio encoding format of the music file
mpg MPEG compliant movies
o Object file (compiler output format, not yet linked)
pdf File in PDF format
ps PostScript file
tex An input file prepared for a TEX formatter
txt Text file
zip The compressed file

On UNIX systems, file extensions are conventions that are not enforced by operating systems.

File structure

Files can be constructed in a number of ways. The following figure lists three commonly used constructs

Three different files. A) Sequence of bytes. B) Record the sequence. C) the tree

In the figure above, a is an unstructured sequence of bytes. The operating system does not care what the contents of the sequence are. All the operating system can see are bytes. Any meaning of its file contents is interpreted only in the user program. Both UNIX and Windows use this approach.

Figure B represents the first improvement in file structure. In this model, a file is a sequence of fixed-length records, each with its own internal structure. The central idea behind using files as sequences of records is that reads return a record and writes rewrite or append a record. The third file structure is shown in figure C. In this organization, a file consists of a record tree, which may not be the same length, and each record tree contains a key field at a fixed position in the record. The tree sorts by key, allowing for quick lookups of specific keys.

The file type

Many operating systems support multiple file types. For example, UNIX (also including OS X) and Windows have regular files and directories. In addition, UNIX has character special files and block Special files. Regular files are files that contain user information. The files commonly used by users are mostly conventional files, which generally include executable files, text files and image files. When reading or writing data from conventional files, the kernel will perform operations according to the rules of the file system, and the writing may be delayed, log or accept other operations.

File access

Early operating systems had only one type of access: sequential access. In these systems, a process can read all the bytes or records in a file in order, but cannot skip and execute them out of order. Sequential access to a file can be returned to the starting point, and the file can be read multiple times as needed. Sequential access to files is convenient when the storage medium is tape rather than disk.

When a file is stored on disk, bytes or records in the file can be read out of order, or records can be accessed by keyword instead of location. Such a file that can be read in any order is called a random access file. Many applications require this approach.

Random file access is essential for many applications, such as database systems. If a passenger calls to book a flight, the booking program must be able to access the flight’s records directly, rather than reading thousands of records for other flights.

There are two ways to indicate where to start reading the file. The first method is to read directly from scratch using read. The other is to set the current position with a special SEEK operation, from which files are read sequentially after the SEEK operation. UNIX and Windows use the latter approach.

File attributes

A file contains a file name and data. In addition, all operating systems store other file-related information, such as the date and time the file was created, and the size of the file. We can call these the attributes of a file. Some people also like to call them metadata. The attributes of files vary greatly between systems. File properties have only two states: Set and Clear.

File operations

The purpose of using files is to store information and facilitate later retrieval. Different systems provide different operations for storage and retrieval. Here are some of the most common system calls related to files:

  1. CreateCreate a file that does not contain any data. The purpose of the call is to indicate that the file is about to be created and to set some properties on the file.
  2. DeleteWhen the file is no longer needed, it must be deleted to free up memory space. There is always a system call to delete files for this purpose.
  3. OpenThe file must be opened before it can be used. The purpose of this call is to allow the system to save a list of properties and disk addresses to main memory for quick access later.
  4. Close, properties and disk addresses are no longer needed, so the file should be closed to free the table space. Many systems limit the number of files a process can open to encourage users to close files they no longer use. The disk writes data in blocks, forcing the last one to be written when the file is closedblockEven if the block of space is not satisfied inside.
  5. ReadData is read from a file. Typically, the data read is from the current location of the file. The caller must specify how much data to read and provide a buffer to hold that data.
  6. Write, writes data to a file, usually from the current location of the file. If the current position is the end of the file, the write is appended directly. If the current location is in a file, the existing data is overwritten and gone forever.
  7. appendAppend can only add data to the end of the file.
  8. seekFor randomly accessed files, specify where to start retrieving data. The usual approach is to use the SEEK system call to point the current location to a specific location in the file. After the seek call ends, you can start reading and writing data from the specified location.
  9. get attributesFile properties are usually read when the process is running.
  10. set attributesThe user can set some file attributes himself, even after the file has been created. This is achieved by the Set Attributes system call.
  11. rename, users can change the names of existing files themselves, and the rename system call is used for this purpose.

directory

File systems usually provide directories or folders for recording the location of files. In many systems directories are files themselves. We’ll discuss files, their organization, properties, and what you can do with them.

Primary directory system

In its simplest form, a directory system has a directory that contains all files. This type of directory is called the root directory, and because the root directory is unique, its name is not important. Such systems were common in the earliest personal computers, in part because there was only one user. Here is an example of a single-tier directory system

A single-layer directory system with four files

There are four files in this directory. The advantage of this design is simplicity and the ability to quickly locate files, since there is only one place to retrieve them. This type of directory organization is now commonly used for simple embedded devices such as digital cameras and certain portable music players.

Hierarchical directory system

Single-layer directories are common for simple applications, but this organization is not suitable for modern computers, which contain thousands of files and folders. If you put them all in the root directory, it will be very difficult to find them. To solve this problem, Hierarchical Directory Systems, also known as Directory trees, emerged. In this way, files can be grouped with many directories. Further, if multiple users share the same file server, such as a corporate network system, each user can have their own private root directory for their directory tree. The organization structure of this approach is as follows

The root directory contains directories A, B, and C, which belong to different users. Two of the users create subdirectories. Users can create as many subdirectories as they want, and modern file systems are organized this way.

The path name

When a directory tree organizes a file system, you need some way to specify the file name. There are two common methods. The first method is to use an absolute path name for each file, which consists of the path from the root directory to the file.

Another way to specify file names is relative Path name. It is often used with working directories (also known as current directories). The user can specify a directory as the current working directory. For example, if the current directory is /usr/ast, the absolute path /usr/ast/mailbox can be referenced directly using mailbox.

Directory operations

The system calls to manage directories in different files differ more than the system calls to manage files. To see what these system calls are and how they work, here is an example (taken from UNIX).

  1. Create, create a directory, except for the directory entry..The contents of the directory are empty.
  2. Delete, delete directories. Only empty directories can be deleted. Contains only..The directory is considered empty, and these two directory entries cannot normally be deleted
  3. opendirThe contents of the directory can be read. For example, if all files in a directory are not listed, the program must first open the directory and then read the names of all files in it. As with opening and reading files, files must be opened before a directory can be read.
  4. closedirAfter reading the directory, close the directory to free the internal table space.
  5. readdirThe system call readdir returns the next directory entry in the open directory. The read system call has been used to read directories before, but this approach has a disadvantage: programmers must understand and work with the internal structure of directories. In contrast, readdir always returns a directory entry in a standard format, regardless of the directory structure.
  6. renameDirectories and files are similar in many ways. Files can be renamed, as can directories.
  7. link, linking technology allows the same file to appear in multiple directories. This system call specifies an existing file and a pathname, and establishes a link from that file to the name indicated by the path. In this way, the same file can appear in multiple directories. Sometimes calledHard Link.
  8. unlinkTo delete the directory entry. If the unlinked file appears in only one directory, it is removed from the file. If it appears in more than one directory, only the links with the specified pathname are removed and the links with the other pathnames remain. In UNIX, the system call used to delete files is called unlink.

File system implementation

File system Layout

File systems are stored on disks. Most disks can be divided into one or more partitions, called disk partitioning or disk slicing. Each partition has its own file system, and each partition can have a different file system. Partition 0 of the disk is called the Master Boot Record (MBR) and is used to Boot the computer. At the end of the MBR is a partition table. Each partition table gives the address from start to end of each partition.

When the computer starts to boot, the BIOS reads in and executes the MBR.

Boot block

The first thing the MBR does is identify the active partition, read its first block, called a boot block, and execute it. The program in the boot block loads the operating system in the partition. For consistency, each partition starts with a boot block, even if the boot block does not contain the operating system. The boot block occupies the first 4096 bytes of the file system, starting with byte offset 0 on disk. The boot block can be used to start the operating system.

In addition to starting with the boot block, the layout of the disk partitions varies from file system to file system. Usually a file system contains a few attributes, such as the following

File system Layout

superblock

Immediately following the boot block is the Superblock, which is 4096 bytes in size, starting with byte offset 4096 on disk. The super block contains all the key parameters of the file system

  • Size of the file system
  • Number of data blocks in the file system
  • Flag indicating the status of a file system
  • Allocate group size

Superblocks are read into memory at computer startup or when the file system is first used.

Free space block

This is followed by information about free blocks in the file system, for example, in bitmaps or pointer lists.

A BitMap or a Bit vector

A bitmap or bitvector is a series of bits or sets of bits, each of which corresponds to a disk block. The bit can take two values: 0, which indicates that the block is allocated, and 1, which indicates a free block. The given disk block instance on the disk in the following figure (with green blocks allocated) can be represented in a 16-bit bitmap as 0000111000000110.

Use linked lists for management

In this approach, free disk blocks are linked together, meaning that one free block contains a pointer to the next free block. The block number of the first disk block is stored in a separate location on the disk and also cached in memory.

debris

I have to mention the concept of fragments, also known as fragments. A generally fragmented single piece of data is often called a fragment. Disk blocks can be further divided into fixed-size allocation units, and fragments are simply file fragments that are not adjacent to each other on the drive.

inode

Then there is an inode(index node), also known as an index node. It’s an array structure, and each file has an inode, and inodes are very important because they tell you everything about a file. Each index node stores properties of object data and disk block locations

There is an easy way to find them with the ls-lai command. Let’s take a look at the root filesystem:

The inode contains the following information

  • Mode/Permission (Protection)
  • Owner ID
  • Group ID
  • The file size
  • Number of hard links to the file
  • Last visit time
  • Last Modified time
  • Inode last modified time

The file is divided into two parts, index nodes and blocks. Once created, the number of blocks of each type is fixed. You cannot increase the number of inodes on a partition, nor can you increase the number of disk blocks.

Immediately following the inode is the root directory, which holds the root of the file system directory tree. Finally, the rest of the disk houses all the other directories and files.

File implementation

The most important problem is keeping track of which disk blocks are used for each file. Different systems take different approaches. We’ll explore these ways below. The main idea behind allocation is efficient use of file space and fast access to files. There are three main allocation schemes

  • Continuous distribution
  • Distribution of the list
  • The index distribution

Continuous distribution

The simplest allocation scheme is to store each file on disk as a series of contiguous data blocks. Therefore, on a disk with 1KB blocks, 50 contiguous blocks will be allocated for 50 KB files.

Use contiguous space to store files

The above shows 40 contiguous blocks of memory. Let’s start with block 0 on the far left. In the initial state, no files have been loaded, so the disk is empty. Next, four blocks of memory A are written from the beginning of the disk (block 0). And then there’s A memory B that takes up six blocks, and it starts right at the end of A.

Note that each file is written at A new block, so if file A takes up only 3 and 1/2 blocks, some of the memory in the last block will be wasted. In the figure above, there are seven files in total, and each file writes a new file block starting at the end of the previous file.

Continuous disk space allocation has two advantages.

  • First, continuous file storage is relatively easy to implement. You only need to remember two numbers: the file address of the first block and the number of blocks of the file. Given the number of the first block, you can find the number of any other block by simple addition.

  • The second point is that the reading performance is relatively strong, you can read the entire file from the file through a single operation. You just have to find the first block once. Seek time and rotation delay are no longer required, so the data comes to disk at full bandwidth.

Therefore, continuous space allocation has the characteristics of simple implementation and high performance.

Unfortunately, continuous space allocation also has obvious disadvantages. Over time, disks can become fragmentary. The diagram below illustrates this phenomenon

There are two files D and F that were deleted. When a file is deleted, the block occupied by the file is also released, leaving some free blocks in disk space. The disk will not squeeze the free block at this location, because it will copy all files after the free block, which could be millions of blocks, which is too large.

Distribution of the list

The second way to store files is to construct linked lists of disk blocks for each file, with each file being a linked list of disk blocks, as shown below

Store files as linked lists of disk blocks

The first word of each block acts as a pointer to the next block, and the rest of the block stores data. If you can’t see the above picture clearly, you can look at the entire list allocation scheme

Unlike the continuous allocation scheme, this approach makes full use of each disk block. With the exception of the last disk block, no storage space is wasted due to disk fragmentation. Similarly, in a directory entry, as long as the first file block is stored, other file blocks can also be found.

On the other hand, in a linked list allocation scheme, while sequential reads are convenient, random access is difficult (a big difference between arrays and linked list data structures).

Another problem is that since Pointers take up some bytes, the actual number of bytes of data stored per disk block is no longer an integer power of two. This is not a serious problem, but it reduces the efficiency of the program. Since the first few bytes of each block are used by Pointers, to read a complete block size requires the information from the current block to be pieced together with the information from the next block, thus causing the overhead of lookups and concatenations.

Use memory tables for linked list allocation

Because continuous allocation and linked list allocation have their own shortcomings can not be ignored. Therefore, a table in memory is proposed to solve the allocation problem. Taking the pointer words for each disk block and placing them in a table in memory solves the two shortcomings of the linked list. Here’s an example

The figure above shows the contents of a linked list of disk blocks. There are two files in both figures. File A uses disk block addresses 4, 7, 2, 10, 12 in sequence, and file B uses 6, 3, 11, and 14. That is, file A starts at address 4 and follows the linked list to find all of file A’s disk blocks. Similarly, starting at block 6 and following the chain to the end, you can also find all the disk blocks for file B. You will notice that both lists end with a special mark (-1) that is not a valid disk number. Such tables in memory are called File Application Tables (FAT).

Directory implementation

Files can only be read if they are opened. After the file is opened, the operating system uses the path name provided by the user to locate the directory on the disk. The directory entry provides the information needed to locate a file disk block. Depending on the system, the information provided may be the disk address of the entire file, or the number of first blocks (two linked list schemes), or the number of inodes. But in that case, the main function of a directory system is to map the ASCII names of files to the information needed to locate the data.

The Shared file

When multiple users are working on the same project, they often need to share files. If the shared file appears in multiple user directories at the same time, they can work together easily. The following image was mentioned above, but one change is that a file from C also appears in B’s directory.

If organized as shown in the figure above, the connection between B’s directory and the shared file is called a link. Instead of a tree, the file system is now an Directed Acyclic Graph (DAG).

Log structured file system

Changes in technology can put pressure on current file systems. In this case, cpus get faster and faster, and disks get bigger and cheaper (but not faster). Memory capacity has also grown exponentially. But the seek time of disks (except solid-state drives, which have no seek time) has not improved.

Berkeley designed a new type of File System to alleviate this problem: the Log-structured File System (LFS). It aims to solve the following problems.

  • Growing system memory

  • Sequential I/O performance beats random I/O performance

  • An existing inefficient file system

  • File Systems do not support RAID (Virtualization)

On the other hand, file systems of the time, both UNIX and FFS, had a large number of random reads and writes (at least five random writes to create a new file in FFS), and thus became a performance bottleneck for the entire system. At the same time, because of the Page cache, the author thinks that random reads are not the main problem: as more and more memory is stored, most reads can be cached, so LFS’s main solution is to reduce random writes to hard disk.

In this design, inodes even have the same structure as in UNIX, but now they are scattered throughout the log rather than in a fixed location on disk. So, inodes are well positioned. To find inodes, an inode map(inode map) indexed by inodes is maintained. Entry I points to the ith inode on the disk. This mapping is kept on disk, but also in the cache, so the most frequently used parts are in memory most of the time.

So far, all writes are initially cached in memory and appended to the end of the log, and all cached writes are periodically written to disk in a single segment. So, opening the file now means locating the file’s index node with a map. Once the inode is located, the address of the disk block can be found. All of these blocks themselves will be in segments somewhere in the log.

In the real world, disk capacity is limited, so eventually the log will fill the entire disk space, in which case no new disk blocks will be written to the log. Fortunately, many existing segments may have blocks that are no longer needed. For example, if a file is overwritten, its inode will be pointed to the new block, but the old disk block will still occupy space in the previously written segment.

To deal with this problem, LFS has a clean thread that cycles through the logs and compresses them. First, look at the information in the first part of the log to see which index nodes and files exist. It checks the mapping of the current inode to see if the inode is still in use in the current block. If not, the information is discarded. If it is still in use, inodes and blocks go into memory to wait to be written back to the next segment. The original segment is then marked as free so that the log can be used to store new data. In this way, the cleanup thread traverses the log, removes the old segment from behind, and then puts the valid data into memory to wait for writing to the next segment. As a result, the entire disk forms a large circular buffer, with the writer thread writing the new segment first and the cleaning thread cleaning the following segment.

Log file system

While log-structured systems are elegantly designed, they are not yet widely used because they do not match existing file systems. However, a new type of journaling system, called journaling file systems, has evolved from journaling file structure systems to keep a log of what the system is going to do next. Microsoft’s NTFS file system and Linux’s Ext3 use this log. OS X offers a logging system as an option. To see how this works, let’s discuss an example, such as removing a file, which in UNIX requires three steps:

  • Delete files in the directory
  • Free inodes to the free inode pool
  • Return all disk blocks to the free disk pool.

Virtual file system

The UNIX operating System uses a Virtual File System (VFS) to try to organize multiple File systems into an ordered structure. The key idea is to abstract out what is common to all file systems and put that code in a layer that then calls the specific file system to manage the data. Here is the architecture of a VFS system

Again, in the computer world, any problem that can’t be solved can be solved by an agent. All file-related system calls are initially processed to point to the virtual file system. These calls from the user process are standard POSIX system calls such as Open, Read, Write, seek, and so on. The VFS has an upper-layer interface to user processes, known as POSIX.

File system management and optimization

It’s one thing to be able to make a file system work, it’s another thing to be able to make a file system work efficiently and stably, so let’s look at file system management and optimization.

Disk Space Management

Files often exist on disk, so managing disk space is an operating system designer’s concern. There are two policies on the file: allocate n bytes of contiguous disk space; Or to split a file into chunks that are not necessarily contiguous. In the storage management system, there are two main management modes: segmental management and paging management.

As we can see, there is an obvious problem with storing files in sequential byte sequences, which may require moving files on disk as they grow. Memory segmentation has the same problem. The difference is that moving the middle of memory is much faster than moving files from one location on the disk to another. As a result, almost all file systems divide files into fixed-size chunks for storage.

The block size

Once files are stored in fixed-size chunks, the question arises: What is the size of the chunk? Depending on the disk organization, sectors, tracks, and cylinders can obviously be used as allocation units. Page size is also a major factor in paging systems.

Having large block sizes means that each file, even a 1-byte file, takes up a cylindrical space, which means that small files waste a lot of disk space. Small chunks, on the other hand, mean that most files will span multiple blocks, requiring multiple searches and rotation delays to read them, which reduces performance. So if you allocate too many blocks, you waste space; Allocating blocks that are too small wastes time.

Record free block

Once the block size is specified, the next problem is how to keep track of free blocks. Two approaches are widely used, as shown in the figure below

The first approach is to use a linked list of disk blocks, each of which contains as many free disk block numbers as possible. For 1 KB blocks and 32-bit disk blocks, each block in the free table contains 255 free block numbers. Consider a terabyte hard drive, which has about a billion disk blocks. To store all address block numbers, if each block could hold 255 blocks, you would need nearly 4 million blocks. Typically, free blocks are used to hold free lists, so the storage is essentially free.

Another technique for free space management is bitmap. N bits are required for n blocks of disk. In the bitmap, free blocks are represented by 1 and allocated blocks are represented by 0. For the example of a 1-TERabyte hard disk, a billion-bit representation is required, which is about 130,000 1-KB block storage. Obviously, bitmaps require less space than the 32-bit linked list model because each block uses one bit. Lists require fewer blocks than bitmaps only when the disk is near full.

Disk quota

To prevent some users from occupying too much disk space, multi-user operations usually provide enforcing disk quotas. The system administrator assigns maximum file and block allocations to each user, and the operating system ensures that users do not exceed their quotas. We’ll talk about this mechanism next.

When the user opens a file, the operating system finds the file properties and disk address and feeds them into the open file table in memory. One of the attributes tells you who owns the file. Any additions to related files are credited to the owner’s quota.

The quota table records the quota of each user

The second table contains a quota record for each user’s currently open file, even if it is opened by someone else. As shown in the figure above, the contents of this table are extracted from the disk quota file of the owner of the opened file. When all files are closed, the record is written back to the quota file.

When a new entry is created in the open file table, a pointer to the owner quota record is generated. Each time a block is added to a file, the total number of blocks used by the file owner increases, and checks for both hard and soft limits are added. Soft limits can be exceeded, but hard limits cannot be exceeded. When the hard limit has been reached, adding more content to the file raises an error. Similarly, a similar check exists for the number of files.

File system Backup

Making a backup of files is time-consuming and a waste of space, which can cause the following problems. First, do you want to back up the entire file or just part of it? In general, only a specific directory and all files under it are backed up, not the entire file system.

Second, the idea of incremental dumps is that it is wasteful to back up files that were not modified the last time. The simplest form of incremental dump is to periodically make a full backup, and only make a single backup of the files that have changed since the incremental dump.

A slightly better approach is to back up only the files that have changed since the last dump. Of course, this greatly reduces dump time, but recovery is more complex, since the most recent full dump is first restored in full, followed by incremental dumps in reverse order. To facilitate recovery, people tend to use more complex dump patterns.

Third, since there is often a huge amount of data to dump, it is necessary to compress the file before writing it to tape. However, if file corruption occurs during backup, the compression algorithm will be broken, making the entire tape unreadable. Therefore, be careful whether to compress files before backup.

Fourth, it is difficult to make a backup of the file system you are using. If files and directories are added, deleted, or modified during the dump, the dump results may be inconsistent. Therefore, because the dump process can take several hours, it is necessary to take the system offline for backup at night, but this approach is not well accepted. So people modify dump algorithms to take instantaneous snapshots of the file system, that is, copy critical data structures, and then need to copy future changes to files and directories into blocks instead of updating them everywhere.

There are two methods for dumping disks to backup disks: physical dump and logical dump. Physical dump starts with 0 disk blocks and writes all disk blocks to the output disk in sequence. The physical dump stops when the last disk is copied. This procedure is infallible in a way that no other procedure can.

The second consideration is bad block dumps. It is impossible to make large disks without defects, so there will be some bad blocks. Sometimes bad blocks are detected and flagged after low-level formatting, and the solution is to replace them with some free blocks at the end of the disk.

However, some blocks deteriorate after formatting, in which case the operating system can detect them. Typically, it solves the problem by creating a file made up of all the bad blocks, ensuring that they never appear in the free pool and are never allocated. Then the file is completely unreadable. If the disk controller remaps all the bad blocks, the physical dump still works.

Windows systems have paging files and Hibernation Files. They do not play a role in restoring files and should not be backed up in the first place.

File system consistency

One factor that affects reliability is file system consistency. Many file systems read disk blocks, modify disk blocks, and write them back to disk. If the system crashes before all blocks are written, the file system is put in an inconsistent state. This is a serious problem if some of the blocks that have not been written back are inode blocks, directory blocks, or blocks that contain free lists.

To deal with file system consistency, most computers have applications that check file system consistency. For example, UNIX has FSCK; Windows has SFC, which you can run whenever you boot the system (especially after a crash).

Two types of consistency checks can be performed: block consistency checks and file consistency checks. To check for block consistency, the application creates two tables, each containing a block with a counter, initially set to 0. The counter in the first table keeps track of how many times the block appears in the file, and the counter in the second table keeps track of how often each block appears in the free list, free bitmap.

File System Performance

Disk access is much more efficient than full memory, so it’s time to revisit this chart

Reading a 32-bit word from memory is about 10ns, while reading from hard disk is about 100MB/S. For each 32-bit word, the efficiency is four times slower, plus 5-10 ms of seek time and other losses. If only one word is accessed, memory is millions of orders of magnitude faster than disk. So disk tuning is necessary, and we’ll discuss some of them below

The cache

The most common technique to reduce the number of disk accesses is to use block cache or buffer cache. A cache is a series of blocks that logically belong to disk but are actually held in memory for performance reasons.

There are different algorithms for managing the cache. A common algorithm is to check all read requests to see if there are any blocks in the cache that are needed. If yes, read operations can be performed without accessing the disk. If the check block is no longer in the cache, it is first read into the cache and then copied to the desired location. After that, requests to the same block are made through the cache.

The operation of caching is shown below

Because there are many blocks in the cache, you need some way to quickly determine if the required block exists. The common approach is to hash the device and disk addresses and then look up the results in the hash table. Blocks with the same hash value are joined together in a linked list (is this data structure much like a HashMap?). , so you can look for other blocks along the collision chain.

If the cache is full and a new block needs to be called in, one of the original blocks needs to be called out of the cache, or if the block to be called has been modified since the last call, it needs to be written back to disk.

Block read ahead of time

The second significant improvement in filesystem performance is to try to write blocks to the cache before they are needed, thereby increasing the hit ratio. Many files are read sequentially. If the requested file system generates block K in a file, the file system performs the operation and, upon completion, checks the cache to determine whether block K + 1 is already cached. If not, the file system arranges a prefetch for k + 1, because the file wants to read the block directly from the cache when it is used.

Of course, the block-ahead read policy only applies to files that are actually read sequentially. For randomly accessed files, reading ahead doesn’t work at all. It can even get in the way.

Reduces the disk arm movement

Caching and block read-ahead are not the only ways to improve filesystem performance. Another important technique is to place blocks that can be accessed sequentially together, preferably on the same cylinder, thereby reducing the number of disk arm movements. When writing an output file, the file system must allocate disk blocks again and again as required. If a bitmap is used to record free blocks, and the entire bitmap is in memory, it is easy to select the nearest free block to the previous one. If you have a free table, and part of the linked list exists on disk, it is much more difficult to allocate adjacent free blocks.

Disk defragmentation

After the initial installation of the operating system, files are constantly created and erased, resulting in fragmentation of the disk. When a file is created, its blocks are scattered all over the disk, reducing performance. Reclaiming disk blocks after deleting files may cause holes.

Disk performance can be restored by moving files next to each other and placing all or at least most of the free space in one or more large contiguity areas. There’s a Windows program called Defrag that does just that. Windows users will use it a lot, with the exception of SSD.

Disk defragmenter will run fine on the file system. Linux file systems (ext2 and ext3 in particular) are generally less difficult to defragment than Windows due to the way they select disk blocks, so manual disk defragmentation is rarely required. Moreover, SSDS are not affected by disk fragmentation, and in fact, defragmentation on SSDS is a waste of time, wearing out the SSD rather than improving performance. So defragmentation will only shorten the life of the SSD.

Let’s look at the I/O process.

I/O devices

What is an I/O device? I/O devices, also known as input/output devices, are the external hardware that humans use to communicate with computers. Input/output devices can send data to and receive data from a computer (input).

I/O devices can be divided into two types: block devices and character devices.

Piece of equipment

A block device is a device capable of storing fixed-size chunks of information that can be read and (optionally) written to fixed-size chunks, sectors, or clusters. Each block has its own physical address. Typically block sizes range from 512 to 65536. All information transmitted will be in contiguous chunks. The basic feature of a block device is that each block is opposite and can read and write independently. Common block devices include hard disks, Blu-ray discs, and USB disks

Block devices generally require fewer pins than character devices.

Disadvantages of block devices

Block devices based on a given solid-state memory are slower than byte addressing based on the same type of memory because reading or writing must begin at the beginning of the block. So, to read any part of the block, you must find the beginning of the block, read the whole block, and discard it if it is not used. To write part of a block, you have to find the beginning of the block, read the whole block into memory, modify the data, find the beginning of the block again, and then write the whole block back to the device.

Character device

Another type of I/O device is the character device. The character device sends or receives a stream of characters in units of characters, regardless of any block structure. The character device is not addressable and does not have any seek operations. Common character devices are printers, network devices, mice, and most other devices that are different from disks.

Equipment controller

The device controller is the system that processes incoming and outgoing signals from the CPU. The device is connected to the computer through a plug and socket, and the socket is connected to the device controller. The device controller receives data from the connected device and stores it in the special purpose registers, also known as local buffers, inside the controller.

Each device controller has an application corresponding to it, and the device controller communicates with the operating system via interrupts through the interface of the application. Device controllers are hardware, and device drivers are software.

Memory-mapped I/O

Each controller has several registers to communicate with the CPU. By writing to these registers, the operating system can command the device to send data, receive data, turn the device on or off, and so on. By reading information from these registers, the operating system can know the status of the device, whether it is ready to receive a new command, and so on.

To control registers, many devices have data buffers that the system can read and write.

How does the CPU communicate with device registers and device data buffers? There are two alternatives. In the first way, each control register is assigned an I/O port number, which is an 8 – or 16-bit integer. The collection of all I/O ports forms a protected I/O port space so that ordinary user programs cannot access it (only the operating system can). Using special I/O instructions like

IN REG,PORT
Copy the code

The CPU can read the contents of the control register PORT and place the results in the CPU register REG. Similarly, use

OUT PORT,REG
Copy the code

The CPU can write the contents of the REG to the control register. Most early computers, including almost all mainframes, such as the IBM 360 and all its successors, worked this way.

The second method, introduced by pdP-11, maps all control registers into memory space.

Direct memory access

Whether or not a CPU has memory-mapped I/O, it needs addressing device controllers in order to exchange data with them. The CPU can request data one byte at a time from the I/O controller, but doing so wastes CPU time, so a scheme called Direct Memory Access is often used. For simplicity, let’s assume that the CPU accesses all the devices and memory through a single system bus that connects the CPU, memory, and I/O devices, as shown in the figure below

DMA transfer operation

Modern operating systems are actually more complex, but the principles are the same. If the hardware has a DMA controller, then the operating system can only use DMA. Sometimes this controller is integrated into the disk controller and other controllers, but this design requires a separate DMA controller on each device. A single DMA controller can be used for transfers to multiple devices, often at the same time.

How DMA works

First the CPU programs the DMA controller by setting its registers, so the DMA controller knows what data to send to where. The DMA controller also issues a command to the disk controller telling it to read data from disk into its internal buffer and verify the checksum. DMA can begin when the valid data is in the buffer of the disk controller.

The DMA controller initiates the DMA transfer by issuing a read request on the bus to the disk controller, which is the second step. This read request is just like any other read request, and the disk controller does not know or care whether it comes from the CPU or the DMA controller. Typically, the memory address to be written is on the bus address line, so when the disk controller matches the next word, it knows where to write it. Writing to memory is another bus loop, which is step 3. When the write operation is complete, the disk controller sends an acknowledgement signal on the bus to the DMA controller, which is the fourth step.

The DMA controller then increases the memory address and decreases the number of bytes. If the number of bytes is still greater than 0, steps 2 through 4 are repeated until the byte count becomes 0. At this point, the DMA controller interrupts the CPU and tells it that the transfer is complete.

To interrupt

In a PC architecture, the interrupt structure would look like this

How does the interruption occur

When an I/O device has finished its work, it generates an interrupt (interrupts are turned on by the default operating system), which it does by declaring allocated signals on the bus. The interrupt controller chip on the motherboard detects this signal and performs the interrupt operation.

Precise interrupts and imprecise interrupts

Interrupts that keep a machine in good shape are called precise interrupts. Such interrupts have four properties:

  • The PC (program counter) is kept in a known place
  • All instructions prior to the instruction to which the PC points have been fully executed
  • None of the instructions after the one the PC points to are executed
  • The execution state of the instruction to which the PC points is known

Interrupts that do not meet these requirements are called imprecise interrupts, and they are a pain in the neck. The figure above depicts the phenomenon of imprecise interrupts. The execution timing and completion of instructions are uncertain and difficult to recover.

IO Software Principles

I/O software targets

Equipment independence

An important goal of I/O software design is device independence. This means that we can write applications that access any device without specifying a specific device in advance.

Error handling

In addition to device independence, the second important goal that I/O software achieves is error handling. In general, errors should be handled at the hardware level. If the device controller finds a read error, it will do its best to fix it. Can’t deal with this problem, if the device controller device driver should be handled, the device driver will try again to read operations, many errors are occasional, if a device driver can’t deal with this error, can put the mistake up to the hardware level (upper) for processing, most of the time, The upper level does not need to know how the lower level resolves errors.

Synchronous and asynchronous transmission

The third goal that I/O software implements is synchronous and asynchronous (interrupt driven) transmission. Let’s talk about synchronous and asynchronous first.

In synchronous transmission, data is usually sent as blocks or frames. The sender and receiver should have a synchronization clock prior to data transmission. Asynchronous transmission, in which data is usually sent as bytes or characters, does not require a synchronous clock, but parity bits are added to the data before transmission. Most physical I/ OS are asynchronous. The CPU in physical I/O is very clever. The CPU moves on to other things after the transmission is complete. It is mentally attuned to the interrupt, and when the interrupt occurs, the CPU will return to the transmission.

The buffer

The final problem with I/O software is buffering. Typically, data sent from one device does not go directly to the last device. During the period will go through a series of calibration, inspection, buffering and other operations to reach.

Sharing and exclusivity

The last problem that I/O software causes is the problem of shared and exclusive devices. Some I/O devices can be shared by many users. Some devices, such as disks, can be used by multiple users without causing problems, but some devices must be exclusive, allowing only one user to use them before other users can use them.

There are three ways to control an I/O device

  • Use programs to control I/O
  • Use interrupt to drive I/O
  • Use DMA to drive I/O

I/O hierarchy

I/O software is typically organized into four layers, whose general structure is shown in the following figure

Let’s explore the above hierarchy in detail

Interrupt handler

Interruptions occur all the time in computer systems, like women’s tempers, and they are often unpleasant. Interrupt handlers, also known as Interrupt Service Routines or ISRs, are the layer closest to the hardware. Interrupt handler An interrupt generated by a hardware interrupt, a software interrupt, or an abnormal startup of software, used to convert device drivers or protected modes of operation (such as system calls).

The interrupt handler is responsible for handling all the operations that occur when the interrupt occurs, blocks when the operation is complete, and then starts the interrupt driver to resolve the blocking. There are usually three types of notification, depending on the specific implementation

  • Semaphore implementation: Used on semaphoreupTo give notice;
  • Pipe implementation: Executes a condition variable in a pipesignaloperation
  • In other cases, messages are sent

Device driver

Each I/O device connected to a computer needs some device-specific code to control it. The code that provides the I/O Device to Device controller conversion process is called Device driver.

The main functions of the device controller are as follows

  • Receive and recognize commands: The device controller can accept and recognize commands from the CPU. There will also be registers inside the device controller to store instructions and parameters

  • Data exchange: Data is exchanged between the CPU, controller, and device. The CPU sends instructions to the controller over the bus or reads data from the controller in parallel. The controller writes data to the specified device.

  • Address recognition: Each hardware device has its own address. The device controller can recognize these different addresses to achieve the purpose of controlling the hardware. In addition, in order for the CPU to write or read data into registers, these registers should have unique addresses.

  • Error detection: The device controller also has the function of detecting data transmitted from the device.

In this case, the device controller blocks until an interrupt unblocks the blocking state. There is also a case where the operation can be completed without delay, so the driver does not need to block. In the first case, the operating system may be woken up by an interrupt; In the second case, the operating system is not hibernated.

Device drivers must be reentrant because device drivers block and wake up and then block again. Drivers are not allowed to make system calls, but they usually need to interact with the rest of the kernel.

Device-independent I/O software

There are two types of I/O software: device-specific software, which we described above, and device-independent software, which means that no specific device is required. The boundary between device drivers and device-independent software depends on the system. The functions shown below are implemented by device-independent software

The basic function of device-independent software is to perform common I/O functions for all devices and provide a unified interface to user-layer software.

The buffer

Buffering is an important consideration for both block and character devices. Buffering technology is widely used, but it also has disadvantages. If the data is buffered too many times, performance will be affected.

Error handling

In I/O, errors are a perfectly normal occurrence. When errors occur, the operating system must handle them as best it can. Some errors are handled only by a particular device, and some are handled by the framework, independent of a particular device.

One type of I/O error is a programmer’s programming error, such as reading a stream without opening a file, or running out of memory without closing the stream, and so on. Such problems are handled by programmers; The other category is actual I/O errors, such as writing data to a bad block of disk that cannot be written no matter how hard it is written. This kind of problem is handled by the driver, and if the driver can’t handle it, it’s handled by the hardware, as we mentioned above.

Unified interface for device drivers

We said in the summary of operating system, the operating system is a very important function of shielding of the difference of the hardware and software for the hardware and software provides a unified standard, the standard also in to provide a unified interface to device drivers, because different hardware and manufacturers to write a device driver is different, So if you provide a separate interface for each driver, you can’t do that, so you have to unify.

Allocate and release

Some devices, such as printers, can only be used by one process. In this case, the operating system needs to check whether the device can accept other requests according to the actual situation. A simple and direct way is to perform the open operation on special files. If the device is not available, opening directly will cause a failure. Another way is not to fail directly, but to block, wait until another process releases resources, and then open. This leaves the choice up to the user to decide whether they should wait.

Device independent blocks

Different disks have different sector sizes, but software doesn’t care about sector sizes, just store them. Some character devices can deliver data one byte at a time, while others deliver data in larger units, and these differences can also be hidden.

User space I/O software

While most I/O software is in the kernel structure, there are also some I/O software implemented in user space where there are no absolutes. Some I/O software and library procedures exist in user space and are then implemented in the form of system calls.

disc

The disk is arguably the simplest structure in hardware, and also the most important. Let’s start with the disk and talk about its physical structure

Disk hardware

There are many types of disks. The simplest of these are magnetic hard disks, also known as hard disks, HDDS, etc. The disk is usually paired with a magnetic head mounted on a magnetic arm that reads and writes data to the disk, so the disk reads and writes equally fast. On a disk, data is randomly accessed, which means that individual blocks of data can be stored and retrieved in any order, so you can place disks anywhere for the heads to read. Disks are non-volatile devices that last forever even after a power failure.

disk

To organize and retrieve data, disks are organized into specific structures, namely tracks, sectors, and cylinders

A disk is organized in a cylindrical form, with each disk connected by a shaft. Each cylinder contains a number of tracks, and each track consists of a number of sectors. Floppy disks have about 8-32 sectors per track, and hard disks have up to a few hundred sectors per track, with about 1-16 heads.

An important feature for disk drivers is whether the controller can manage two or more drives at the same time for track addressing. This is called overlapped seek. For the controller, it can control one disk driver to complete the seek while the other drivers wait for the seek to end. The controller can also read and write to one driver while the other drives seek, but the floppy controller cannot read and write to both drives.

RAID

RAID is called redundant array of Disks, or disk array for short. The virtualization technology is used to combine multiple disks into one or more disk arrays to improve performance or data redundancy.

RAID has different levels

  • RAID 0 – Fault-tolerant striped disk array
  • RAID 1 – Mirroring and duplex
  • RAID 2 – Memory error correction codes
  • RAID 3 – Bits interleaved parity check
  • RAID 4 – Block staggered parity check
  • RAID 5 – Block staggered distributed parity check
  • RAID 6 -P + Q redundancy

Disk formatting

Disks are made up of a bunch of aluminum, alloy, or glass disks. When disks are created, they have no information. Disks must be formatted in low- Levvel format before use. Here is a sector format

The leading code is used to indicate the starting position of the sector. It usually starts with a bit pattern. The leading code also includes other information such as the cylinder number and sector number. Following the lead code is the data area, the size of which is determined by the low-level formatter. Most disks use 512-byte sectors. After the data area is ECC, the full name of ECC is error correction code, data error correction code, which is different from common error detection, ECC can also be used to recover read errors. The size of the ECC stage is implemented by different disk manufacturers. The design criteria for ECC size depend on how much disk space the designer is willing to sacrifice to improve reliability, and how complex the ECC can be handled by the program. ECC is usually 16 bits, and in addition, hard disks generally have a certain number of spare sectors to replace the sectors in which the manufacturing defect was made.

Disk arm scheduling algorithm

Let’s discuss the algorithms that affect disk read and write. In general, the time that affects disk fast read and write is determined by the following factors

  • Seek time – The seek time is the time to move the disk arm to the disk block to be read
  • Rotation delay – The time required to wait for the appropriate sector to rotate under the head
  • The actual reading or writing time of data

These three time parameters are also the disk seek process. In general, the seek time has the greatest impact on the total time, so effectively reducing the seek time can improve disk read speed.

If the disk driver receives requests one at a time and completes them in the order they are received, which is first-come, first-served (FCFS), it is difficult to optimize the seek time. Because each request is processed sequentially, no matter what the order, it is possible to wait for one disk to rotate one week after reading, while the other cylinders can read immediately, in which case each request will queue up as well.

Normally, while the disk is seeking, other processes will make additional disk requests. The disk driver maintains a table in which cylinder numbers are recorded as indexes. Unfinished requests for each cylinder form a linked list. The linked list heads are stored in the corresponding entries in the table.

An alternative to the first-come-first-served algorithm is to use the shortest path First (SSF) algorithm, which is described below.

If a request for 11, 2, 4, 14, 8, 15, 3 occurs simultaneously while addressing track 6, if the first come, first served principle is used, as shown in the figure below

We can calculate that the number of disks the disk arm spans is 5 + 9 + 2 + 10 + 6 + 7 + 12 = 51, which is equivalent to 51 disk spans. If using shortest path first, we can calculate the number of disk spans

The number of disks across is 4 + 1 + 1 + 4 + 3 + 3 + 1 = 17, which saves twice as much time as 51.

However, the shortest path first algorithm is not perfect, this algorithm still has a problem, that is the priority problem,

A prototype for reference here is the elevator in our daily life, which uses an elevator algorithm for scheduling to meet the conflicting goals of coordinated efficiency and fairness. The elevator will generally keep moving in one direction until there are no requests in that direction, and then change direction.

The elevator algorithm maintains a binary bit, which is the current direction bit: UP or DOWN. When a request is processed, the disk or elevator driver checks the bit, and if the bit is UP, the disk arm or elevator bin moves to the next higher level to drop the pending request. If the high level has no outstanding requests, take the opposite direction. When the direction bit is DOWN and there is a low level request, the disk arm will turn to this point. If it doesn’t, it just stops and waits.

Let’s take an example to describe the elevator algorithm. For example, each cylinder is served in the order of 4,7,10,14,9,6,3,1. Then its flow chart is as follows

So the number of disks the elevator algorithm needs to span is 3 + 3 + 4 + 5 + 3 + 3 + 1 = 22

Elevator algorithms are generally inferior to SSF algorithms.

Error handling

Generally, there are two ways to deal with bad blocks. One is to deal with bad blocks in the controller. One is at the operating system level.

These two methods are often used interchangeably, such as on a disk with 30 data sectors and two spare sectors, where sector 4 is defective.

All the controller can do is remap one of the alternate sectors.

Another way to do this is to move all sectors up by one sector

In both cases, the controller must know which sector it is, and can track this information through internal tables, or override the lead code to give the remapped sector number. If the lead code is overwritten, then the way in which the movement is involved must override all subsequent leads, but ultimately provides good performance.

Stable memory

Disk errors often occur, causing good sectors to become bad sectors and drivers to hang. RAID can protect against sector errors or drive crashes, but RAID cannot protect against write errors in bad data, or against crashes during write operations that destroy the original data.

We expect the disk to work perfectly, which it does not, but we do know that a disk subsystem has the following properties: when a write command is given to it, the disk either writes correctly or does nothing, leaving the existing data intact. Such a system is called stable storage. The goal of stable memory is to ensure disk consistency at all costs.

Stable memory uses two pairs of identical disks, and the corresponding blocks work together to form an undifferentiated block. Stable memory For this purpose, the following three operations are defined:

  • Stable Write
  • Stable read
  • Crash Recovery

The clock

Clocks, also known as timers, are essential to any programmed system. The clock is responsible for maintaining time, preventing a process from occupying CPU time for a long time and other functions. Clock software is also a device driver. Here we will introduce the clock, usually discussing the hardware first and then introducing the software, using a bottom-up way, but also to tell you that the bottom is the most important.

The clock hardware

There are two types of clocks in computers, and these clocks are completely different from those used in real life.

  • A simpler clock is connected to a 110 V or 220 V power cord, so that eachVoltage cycleThere will be an interruption, about 50-60 HZ. These clocks used to dominate.
  • The other clock consists of a crystal oscillator, a counter and a register, as shown in the diagram below

Called a programmable clock, the programmable clock has two modes, one is one-shot mode. When the clock is switched on, it copies the value in memory into the counter. Then, every time the crystal’s oscillator pulses, the counter will be -1. When the counter turns to 0, an interrupt is generated and work stops until the software shows up again. There is also square-wave mode, in which the value of the storage register is automatically copied to the counter when the counter goes to 0 and an interrupt occurs. This periodic interruption is called a clock cycle.

The clock software

The work of the clock hardware is only based on the known time interval interruption, and other work is done by the clock software, the general operating system is different, the specific implementation of the clock software is also different, but generally will include the following points

  • Maintain the time of day
  • Prevents a process from running longer than its specified time
  • Collect statistics about CPU usage
  • Handles warning system calls for user processes
  • Provides watchdog timers for all parts of the system
  • Complete profiling, monitoring and information gathering

Soft timer

Clock software, also known as a programmable clock, can be set to trigger interrupts at any rate the program needs. Interrupts triggered by clock software are hard interrupts, but some applications are unacceptable for hard interrupts.

In this case, a soft timer is needed to avoid interrupts. Whenever the kernel is running for some reason, it checks the clock to see if the soft timer has expired before returning to user mode. If the soft timer expires, the scheduled event does not need to switch to kernel state because it is already in kernel state. This method avoids frequent switching between kernel mode and user mode, and improves the efficiency of program operation.

The rate of the softtimer switching to the kernel mode is different for different reasons

  • The system calls
  • TLB misses
  • Missing page exception
  • I/O interrupt
  • The CPU becomes idle.

Deadlock is also a very important problem in operating system

resources

Most deadlocks are resource-related and occur when a process has exclusive access to devices or files. We call such objects that require exclusive use resources. Resources are divided into preemble resources and non-preemble resources

### Preemble resources and non-preemble resources

Resources mainly include preemble resources and non-preemble resources. A preemptable resource can be preempted from the process that owns it without causing any harm. Memory is a preemptable resource, and any process can preempt the use of memory.

A nonpreemtable resource means that a process cannot preempt a specified resource unless an error or exception is caused. A nonpreemtable resource, such as a CD-ROM, is unavailable to other processes during process scheduling.

A deadlock

If there is a definition for deadlocks, the following definition would be appropriate

If each process in a group of processes is waiting for an event that can only be triggered by another process in the group, this can result in a deadlock.

Conditions for resource deadlocks

For the purposes of our description above, resource deadlocks can occur in the following ways

  • Mutual exclusion: Each resource is assigned to a process or is available
  • Hold and wait conditions: Processes that have acquired resources are considered capable of acquiring new resources
  • Non-preemption condition: a resource assigned to one process cannot be forcibly preempted by another process. It can only be released by the process that owns it
  • Loop waiting: When a deadlock occurs, there must be two or more processes in the system forming a loop, each process in the loop waiting for resources to be released by the next process.

When a deadlock occurs, all of the above conditions must occur simultaneously. If any of these conditions are not true, a deadlock will not occur. Deadlock can be broken by breaking any of these conditions, which are the focus of our discussion

Deadlock model

Holt proposed modeling deadlock in 1972, and the modeling criteria are as follows:

  • The circle represents progress
  • The square represents the resource

Moving from a resource node to a process node indicates that the resource has been occupied by the process, as shown in the following figure

In the figure above, resource R is being used by process A

A directed graph from process node to resource node indicates that the current process is requesting the resource and that the process is blocked waiting for the resource

In the figure above, it means that process B is requesting resource S. Holt argues that deadlock descriptions should read as follows

This is a deadlock process. Process C is waiting for resource T to be released, but resource T has been occupied by process D. Process D is waiting for request to occupy resource U, but resource U has been occupied by thread C, thus forming a loop.

There are four strategies for dealing with deadlocks:

  • Ignoring deadlocks (stunned)
  • Detect and recover deadlocks, detect deadlocks when they occur, and take action to resolve the problem once they occur
  • Avoid deadlocks by carefully allocating resources
  • Avoid deadlocks by breaking one of the four conditions that cause deadlocks

Let’s take a look at each of the four methods

The ostrich algorithm

The simplest solution is to use the Ostrich algorithm, bury your head in the sand and pretend the problem is not happening. Everyone has a different reaction to this question. Deadlocks are considered unacceptable by mathematicians and must be prevented by effective strategies. Engineers want to know how often problems occur, how many times the system crashes for other reasons, and the consequences of deadlocks. If deadlocks occur infrequently and often crash the system due to hardware failures, compiler errors, and other operating system problems, most engineers will not fix deadlocks.

Deadlock detection and recovery

The second technique is deadlock detection and recovery. This solution does not attempt to prevent deadlocks. Instead, such a solution would want deadlocks to occur as often as possible and recover them when they are detected. Let’s look at several ways to detect and recover deadlocks

Deadlock detection for one resource of each type

What does it mean to have one resource for each resource type? We often refer to the printer is such, resources only printer, but not more than one device.

You can detect this error by constructing a resource allocation table, as we mentioned above

If the graph contains one or more rings, then a deadlock exists, and any process in the ring is a deadlocked process.

Deadlock detection for multiple resources of each type

If multiple identical resources exist, another method is needed to detect deadlocks. Deadlocks in n processes from P1 -> Pn can be detected by constructing a matrix.

Now we provide a matrix-based algorithm to detect deadlocks in n processes from P1 to Pn. Assume that the resource type is M, E1 indicates resource type 1, E2 indicates resource type 2, and Ei indicates resource type I (1 <= I <= m). E is the existing resource vector, representing the total number of existing resources of each type.

Now we need to construct two arrays: C represents the Current Allocation matrix and R represents the Request matrix. Ci represents the number of resources held by Pi for each type of resource. So, Cij represents the amount of resource J that Pi holds. Rij represents the amount of resource j that Pi needs to obtain

In general, the number of allocated resources j is added to all available resources = the total number of such resources.

Deadlock detection is based on vector comparison. Each process is initially unmarked, and the algorithm will start to mark the process. Once the process is marked, it indicates that the process is executed and will not enter the deadlock. When the algorithm ends, any process that is not marked will be considered as a deadlock process.

Above we discussed two ways to detect deadlocks. Now that you know how to detect deadlocks, when do you do it? In general, there are two criteria:

  • Detecting every time a resource request is made can take up expensive CPU time.
  • Check every k minutes or when the CPU usage drops below a certain level. For CPU efficiency reasons, if a certain number of deadlocked processes are reached, there are not many processes to run, so the CPU is often idle.

Recovering from a deadlock

Above we discussed how to detect the process deadlock, our ultimate goal is to ensure that the program can run normally, so for the detected deadlock, we need to restore it, we will discuss how to restore deadlock

Recovery through preemption

In some cases, a resource may be temporarily transferred from its holder to another process. For example, a resource can be forcibly removed from a process and sent back to another process without notifying the original process. This kind of recovery method is generally difficult and some simple rough, not desirable.

Restore by rolling back

If the system designer and machine operator are aware of the possibility of deadlocks, the process can be periodically reviewed. A checkpoint for a process means that the state of the process can be written to a file for later recovery. The checkpoint contains not only the memory image but also the resource state. A more efficient solution is not to overwrite the original checkpoint, but to write each checkpoint to a file so that a series of checkpoint files can be accumulated as the process executes.

To restore, start at the previous earlier checkpoint so that the process requiring resources is rolled back to the previous point in time when the deadlocked process has not yet acquired the required resources and can be allocated at that point.

Kill process recovery

The simplest and most effective solution is to simply kill a deadlocked process. But killing one process may not work, and you need to kill another resource to recover.

Another option is to select an out-of-loop process as a sacrificial lamb to free process resources.

Deadlock avoiding

We discussed how to detect deadlocks and how to recover deadlocks. Now we discuss how to avoid deadlocks

A banker algorithm for a single resource

Banker algorithm is a scheduling algorithm proposed by Dijkstra in 1965, which itself is a deadlock scheduling algorithm. Its model is based on a banker in a town who promises a certain amount of credit to a customer in the town. All the algorithm has to do is determine whether the request will enter an unsafe state. If so, the request is rejected, and if the system is secure after the request, the request is accepted.

Similarly, there are multiple sources of banker algorithms that readers can learn for themselves.

Break the deadlock

Deadlocks are inherently unavoidable because they require the acquisition of unknown resources and requests, but deadlocks occur when four conditions are met, respectively

  • The mutex
  • Hold and wait
  • Do not take
  • Loop waiting for

We discuss these four conditions separately, and it stands to reason that breaking any one of them can break a deadlock

Break the mutex condition

Our first concern is to break the mutex condition. If the resource is not monopolized by a process, deadlock will definitely not occur. If two printers using the same resource creates confusion, the printer’s solution is to use spooling printers, a technology that allows multiple processes to produce output at the same time. In this model, the only process that actually requests the printer is the printer daemon, also known as the background process. Background processes do not request additional resources. We can eliminate deadlocks on printers.

Background processes are usually written to print a complete file, and if both processes take up half of the spooler space and neither process completes the entire output, a deadlock will result.

Therefore, as few processes as possible can request resources.

Break the condition that keeps waiting

The second way is that if we can prevent the process that holds the resource from requesting another resource, we can eliminate deadlocks. One way to do this is to have all processes request all resources before they start executing. If the required resources are available, the process completes the allocation and runs to completion. If any of the resources are frequently allocated, processes that are not allocated will wait.

Many processes do not know how many resources are needed before execution is complete. If they do, they can use the banker algorithm. Another problem is that this is not an efficient use of resources.

Alternatively, when a process requests other resources, it releases the occupied resources and then tries again to obtain the full resources.

Break the non-preemption condition

It is also possible to break non-preemption conditions. This can be avoided by virtualization.

Break the loop wait condition

Now there is only one final condition left, and the circular wait condition can be broken in a number of ways. One way to do this is by setting a standard that only one resource can be used by a process at any one time. If another resource is needed, the current resource must be released. This limitation is unacceptable for processes that require copying large files from tape to a printer.

Another option is to uniformly number all the resources, as shown in the figure below

A process can make requests at any time, but all requests must be made in the order of resources. If this allocation rule is followed, there will be no loops between resource allocations.

Although deadlocks are eliminated in this way, the numbering order is not acceptable to every process.

Other problems

Let’s explore other issues, including communication deadlocks, what are live locks, hunger issues, and two-phase locking

Two-stage locking

Although deadlock avoidance and prevention can be handled in many cases, it does not work well. Over time, many excellent algorithms have been proposed to deal with deadlocks. For example, in a database system, a common operation is to request that some records be locked and then update all locked records. When multiple processes are running at the same time, there is a risk of deadlocks.

One solution is to use two-phase locking. As the name implies, there are two phases. In the first phase, the process attempts to lock all the records it needs at once. If successful, the second phase begins, where the update is performed and the lock is released. The first stage is not really meaningful work.

If records required by a process in phase 1 have been locked, the process releases all locked records and restarts phase 1. In a sense, this approach is similar to requesting all necessary resources up front or before performing some irreversible operation.

However, in general application scenarios, the two-phase locking strategy is not universal. It is unacceptable for a process to break in mid-stream and start again if it lacks resources.

Communication deadlock

We’ve been talking about resource deadlocks, which are one type of deadlock but not the only type, and communication deadlocks, which are deadlocks that occur when two or more processes send messages. Process A sends A message to process B, and process A blocks until process B returns A response. If the request message is lost, process A is waiting for A reply, and process B is blocking waiting for the request message to arrive, A deadlock occurs.

Although A deadlock can occur, it is not A resource deadlock because A does not occupy B’s resources. In fact, communication deadlocks do not have fully visible resources. According to the definition of a deadlock, each process blocks because it is waiting for an event caused by another process. This is a deadlock. In contrast to the most common kind of communication deadlock, we call this situation a communication deadlock.

Communication deadlocks cannot be avoided by scheduling, but they can be avoided by using a very important concept in communication: timeout. In the communication process, as long as a message is sent, the sender will start a timer, which records the timeout time of the message. If the timeout time expires but the message is not returned, the message is considered lost and resend. In this way, communication deadlock can be avoided.

However, not all network communication deadlocks are communication deadlocks. Resource deadlocks also exist. The following is a typical resource deadlock.

When a packet enters a router from a host, it is put into a buffer and then transmitted to another router, another router, and so on to the destination. Buffers are resources and are limited in number. As shown in the figure below, each router has 10 buffers (there are actually many).

Suppose all of router A’s data needs to be sent to router B, all of router B’s packets need to be sent to Router D, and then all of Router D’s packets need to be sent to Router A. No packets can be moved because there is no buffer available at the other end, a classic resource deadlock.

Live lock

In some cases, when a process realizes that it cannot acquire the next lock it needs, it will try to politely release the acquired lock and then wait a very short time to try again. Imagine this scenario: when two people meet in a narrow lane, both want to give way to the other side, the same pace will cause both sides to move forward.

Now imagine a pair of parallel processes using two resources. After each failed attempt to acquire another lock, both processes release the lock they hold and try again, repeating the process all the time. Obviously, there is no process blocking, but the process still does not execute down, a condition we call livelock.

hunger

A very similar problem to deadlock and live locks is starvvation. Imagine when you will be hungry? Does not eat for a period of time meeting hungry? For processes, the most important thing is resources, and if resources are not available for a period of time, then processes will starve and those processes will never be served.

Let’s assume that the printer’s allocation scheme is to allocate to the smallest process every time, so the process that prints large files will never be served, causing the process to starve and delay indefinitely, even though it is not blocked.

Pay attention to the QR code reply “OS brain Map” to obtain high-definition mind map

Reply “OS” to get operating system PDF