This article takes you through what algorithms are present in operating systems

Algorithms in process and thread management

Processes and threads there are many algorithms used in scheduling. The background of these algorithms is that when a computer is a multi-programming system, there are frequently many processes or threads competing for CPU time slices at the same time. Choosing the right process/thread to run is an art. This happens when two or more processes/threads are in a ready state. If only one CPU is available, then you must choose which process/thread can run next. There is a role in the operating system called a scheduler that does just that. The scheduler uses an algorithm called a scheduling algorithm.

Scheduling algorithm classification

There are different algorithms for different operating systems. Operating systems are mainly divided into the following categories

  • Batch operating system
  • Interactive operating system
  • Real time operating system

Let’s take a look at the algorithms in each of these operating systems.

Algorithms in batch operating systems

Design goals

Batch processing systems are widely used in business, such as processing payroll, inventory, accounting receipts, accounting expenses, interest calculations, claims processing, and other cyclical operations. In batch systems, non-preemptive algorithms or preemptive algorithms with long periodicity are generally used. This approach reduces thread switching and therefore improves performance.

In an interactive user environment, a preemptive algorithm is needed because it avoids holding up processes for long periods of time for the sake of the user experience. It is also possible to exclude all other processes indefinitely because one process is faulty. To avoid this, preemption is also necessary.

In real-time systems, preemption is not necessary because processes know they may not run for very long and usually do their work quickly and hang up.

Key indicators

There are usually three metrics to measure system health: throughput, turnaround time, and CPU utilization

  • Throughput (throughout)Is the number of jobs completed by the system per hour. All things considered, 50 tasks per hour is better than 40.
  • Turnaround TimeIs a type of average time, which refers to the average time from a batch submission until the job completes. This data measures the average wait time for the user to get the output. The smaller the turnaround time, the better.
  • CPU UtilizationUsually used as an indicator on batch systems. Even so, CPU utilization is not a good metric, and the really valuable metric is how many jobs the system can do per hour (throughput) and how long it takes to do them (turnaround time).

Let’s look at algorithms in batch processing.

First come, first served

It’s like a first come, first served… It’s a non-preemptive algorithm. This algorithm allocates cpus to processes in order of request. At its most basic, there is a queue of ready processes. When the first task enters the system from outside, it starts immediately and allows it to run for any length of time. It will not be interrupted by running too long. When other jobs come in, they go to the end of the ready queue. When the running process blocks, the first process in the wait queue starts running. When a blocked process is back in the ready state, it will appear as a newly arrived task at the end of the queue, at the end of all processes.

The power of this algorithm is that it is easy to understand and program. In this algorithm, a single linked list keeps track of all ready processes. To select a process to run, simply remove a process from the head of the queue. To add a new job or block a process, simply append the job or process to the end of the queue. This is a very simple implementation.

If you have 100 I/O processes queuing and the 101st one is cpu-intensive, then you have to wait for 100 I/O processes to finish running before one cpu-intensive process runs. This is not possible in practice, so priority or preemptive processes are needed to prioritize important processes to run.

Shortest job first

The second scheduling algorithm in batch processing is Shortest Job First and we assume that the running time is known. An insurance company, for example, can predict with considerable accuracy how long it will take to process a batch of 1,000 claims because it does similar work every day. When several equally important jobs are started in the input queue, the scheduler should use the shortest first job algorithm

As shown in figure A, there are four jobs A, B, C, and D, with running times of 8, 4, 4, and 4 minutes respectively. If running in the order shown in the figure, the turnaround time of A is 8 minutes, B is 12 minutes, C is 16 minutes, D is 20 minutes, and the average time is 14 minutes.

Now consider using the shortest job first algorithm to run four jobs, as shown in Figure B. The current turnaround time is 4, 8, 12 and 20 respectively, with an average of 11 minutes, which can prove that shortest job first is optimal. Consider the case where there are four jobs with running times a, B, C, and D respectively. The first job ends at time A, the second at time A + B, and so on. The average turnaround time is (4A + 3b + 2C + d) / 4. Obviously A has the greatest impact on the average, so A should be the shortest priority job, followed by B, then C, and then D and it can only affect its own turnaround time.

It is important to note that the shortest job first algorithm is optimal when all processes are running.

Minimum remaining time is preferred

The Shortest job first preemptive version is known as the Shortest Remaining Time Next algorithm. With this algorithm, the scheduler always selects the process with the shortest remaining running time to run. When a new job arrives, its total time is compared to the remaining time of the current process. If the new process takes less time than the current running process, the current process is suspended and the new process is run. In this way, short-term operations can be well served.

Scheduling in interactive systems

Interactive systems are so common in PCS, servers, and other systems that it’s worth exploring interactive scheduling

Polling scheduling

One of the oldest, simplest, fairest and most widely used algorithms is the round robin algorithm. Each process is assigned a period of time, called a quantum, within which the process is allowed to run. If the process blocks or ends before the time slice ends, the CPU switches immediately. The polling algorithm is relatively easy to implement. What the scheduler does is maintain a list of runnable processes, like a in the image below, that are moved to the end of the queue when a process runs out of time, like B in the image below.

The only interesting aspect of time slice polling scheduling is the length of time slice. Switching from one process to another takes time for administrative processing, including saving register values and memory maps, updating different tables and lists, and clearing and reloading the memory cache. These switches are called process switch and context switch.

Priority scheduling

Polling scheduling assumes that all processes are equally important. But that may not be the case. For example, in a university hierarchy, first the dean, then the professors, secretaries, support staff, and finally the students. This consideration of external circumstances enables priority scheduling

The basic idea is clear: each process is assigned a priority, and the process with the highest priority runs first.

This does not mean that high-priority processes can run forever, however, as the scheduler reduces the priority of the currently running process during each clock interrupt. If this action causes its priority to drop below that of the next highest process, a process switch occurs. Alternatively, each process can be assigned a maximum interval of time allowed to run. When the interval runs out, the next higher-priority process gets a chance to run.

It is convenient to divide a group of processes into several classes according to their priorities, and use priority scheduling between the classes, and use rotation scheduling within the various processes. A system of four priority classes is shown below

Its scheduling algorithm is mainly described as follows: There are runnable processes of priority 4, and a time slice is run for each process according to the rotation method. At this time, processes of lower priority are ignored. If the class 4 process is empty, the class 3 process is run in a polling manner. If both class 4 and class 3 processes are empty, the class 2 process is run as a rotation. If priorities are not adjusted, low-priority processes are prone to starvation.

Shortest process first

For batch systems, since minimum job first is often accompanied by minimum response times, it would be nice to be able to use it for interactive processes. Interactive processes typically follow the following pattern: wait for command, execute command, wait for command, execute command… If we treat each command execution as a separate job, we can minimize response times by running the shortest job first. The only problem here is how to find the shortest of the currently runnable processes.

One way is to make assumptions based on the process’s past behavior and execute the one with the shortest estimated running time. Assuming that the estimated running time of each command on each terminal is T0, now assuming that its next run time is measured to T1, the estimated time can be improved by weighting two values, namely aT0+ (1-1)T1. By choosing the value of A, you can decide whether to forget the old running times as soon as possible or to keep them in mind for a long time. When a is equal to 1/2, you get the following sequence

As you can see, after three rounds, T0’s share of the new estimate has dropped to 1/8.

This technique of taking a weighted average of a current measurement and a previous estimate to get the next one is sometimes called aging. This method uses a lot of predictive values based on the current values.

Ensure that scheduling

A completely different scheduling approach is to make explicit performance guarantees to users. A practical and easily implemented guarantee is that if a user is working with n users logged in, each user will get 1/n of the CPU processing power. Similarly, in a single-user system with n processes running, if all processes are equivalent, each process will get 1/n of CPU time.

Lottery scheduling

Making promises to users and then delivering on them is a good thing, but hard to do. However, there is an algorithm that can not only give the prediction results but also has a relatively simple way of implementation, namely lottery Scheduling algorithm.

The basic idea is to provide a lottery of various system resources, such as CPU time, for processes. When a scheduling decision is made, a lottery ticket is randomly drawn, and the process that owns the lottery ticket gets the resource. When applied to CPU scheduling, the system can hold up to 50 sweepstakes per second, with each winner receiving, say, 20 milliseconds of CPU time as a reward.

You can exchange tickets between processes if you want them to collaborate. For example, if a client process blocks after sending a message to the server process, the client process might give the server all its tickets to increase the chances of the server running again. When the service is complete, it returns the lottery ticket to the client so it can run again. In fact, if there were no clients, the server would not need the lottery at all.

Think of the lottery as a buff, which gives you a 15% chance to produce speed boots.

Fair share scheduling

So far, we have assumed that each process is being scheduled, regardless of who owns the process. As a result, if user 1 starts nine processes and user 2 starts one, using rotation or the same priority scheduling algorithm, user 1 will get 90% of the CPU time and user 2 will get 10% of the CPU time.

To prevent this, some systems take process owners into account before scheduling. In this model, each user allocates some CPU time, and the scheduler selects processes and forces them to execute. So if two users are each guaranteed a 50% CPU slice, then no matter how many processes a user has, they will get the same CPU share.

Scheduling in real-time systems

Real-time system A system that has time requirements. Real-time systems can be divided into two categories: hard real time systems, which must meet absolute deadlines, and soft real time systems, which must meet absolute deadlines. The latter means you don’t want to miss deadlines occasionally, but you can tolerate them. In both cases, real-time is achieved by dividing the program into a group of processes, each of which is predictable and known in advance. These processes tend to be short-lived and run extremely quickly. When an external signal is detected, the scheduler’s job is to schedule the process as long as all deadlines are met.

Events in a real-time system can be further classified by response as either periodic (occurring at regular intervals) or non-periodic (occurring at unpredictable times) events. A system may respond to multiple periodic streams of events, and depending on how long it takes to process each event, it may not even be able to process all of them. For example, if there are m cycle events, event I occurs at cycle Pi, and it takes Ci seconds of CPU time to process an event, then the load can be handled under the condition that

Only real-time systems that meet this condition are called schedulable, which means that it can actually be implemented. A process that does not meet this test cannot be scheduled because the total amount of CPU time required by these processes is greater than the CPU can provide.

The scheduling algorithm of real-time system can be static or dynamic. The former makes scheduling decisions before the system starts to run; The latter makes scheduling decisions during operation. Static scheduling works only if you can know in advance what work is being done and what deadlines must be met, while dynamic scheduling does not require these limitations.

Scheduling policies and mechanisms

So far, we have implicitly assumed that all processes in the system belong to different groupings of users and compete with each other for CPUS. This is usually true, but sometimes a process can have many children and run under their control. For example, a DATABASE management system process may have many child processes. Each child process may handle different requests, or each child process may perform different functions (such as request analysis, disk access, and so on). It is entirely possible for the main process to control which subprocesses are the most important (or urgent) and which are the least important. However, none of the scheduling algorithms discussed above receives the relevant scheduling decision information from the user process, which results in the scheduler rarely making the optimal choice.

The solution is to separate the scheduling mechanism from the scheduling policy, which has been a long-standing principle. This means that the scheduling algorithm is parameterized in some way, but the parameters can be filled in by the user process. Let’s consider the database example first. Suppose the kernel uses a priority scheduling algorithm and provides a system call for the process to set the priority. In this way, the parent process can control the details of how the child process is scheduled, even though it does not participate in the scheduling itself. The scheduling mechanism is located in the kernel, and the scheduling policy is determined by the user process. The separation of scheduling policy and mechanism is a key idea.

Algorithms in memory management

Operating system in memory management has also appeared many algorithms, the ultimate goal of these algorithms are to allocate memory reasonably.

Operating systems have two types of memory management, one is bitmap, the other is linked list.

There are several variations of the approach when using linked lists to manage memory

When storing processes and free areas in a linked list by address, there are several algorithms that can allocate memory for processes that are created (or swapped in from disk). Assuming that the memory manager knows how much memory to allocate, the simplest algorithm is to use first Fit. The memory manager scans along the segment list until it finds a large enough free area. Unless the size of the free area is the same as the size of the space to be allocated, the free area is divided into two parts, one for the process; Some generate new free areas. The first fit algorithm is a fast algorithm because it searches the linked list as much as possible.

A small variation of the first fit is next fit. It works in the same way as the first match, except that the next match records its location each time it finds a suitable free area, so that the next time it finds a free area it starts where it left off, rather than starting from the beginning like the first match algorithm does. Bays(1997) proved that the performance of the next adaptation algorithm is slightly lower than that of the first matching algorithm.

Another well-known and widely used algorithm is Best Fit. The best fit is to go through the list and find the smallest free area that can accommodate the process. The best fit algorithm tries to find the free area closest to the actual need in order to best match the request and available free area, rather than first breaking up one large free area at a time that may be used later. For example, if we now need a block of size 2, the first match algorithm will allocate the block to the free region at position 5, and the best match algorithm will allocate the block to the free region at position 18, as follows

So what is the performance of the best fit algorithm? The best match algorithm traverses the entire linked list, so the performance of the best match algorithm is worse than the first match algorithm. But surprisingly, the best-fit algorithm wastes more memory than the first-and-second-match algorithm because it generates a lot of useless small buffers, and the first-match algorithm generates a larger free area.

The best-fit free area splits into many very small buffers. To avoid this problem, consider using a worst-fit algorithm. Always allocate the largest memory area (so you can see why the best fit algorithm splits many small buffers) so that the newly allocated free area is large enough to continue to use. Simulations show that worst-fit algorithms are not a good idea either.

The speed of all four algorithms can be improved if separate linked lists are maintained for processes and free areas. Thus, the goal of all four algorithms is to examine the free area rather than the process. But one of the unavoidable costs of this increased allocation speed is increased complexity and slower memory free times, because a reclaimed segment must be removed from the process list and inserted into the free list area.

If the process and the free area use different lists, then the free area list can be sorted by size to speed up the best fit algorithm. When using the best matching algorithm to search the list of free areas arranged from small to large, as long as an appropriate free area is found, this free area is the smallest free area that can accommodate the job, so it is the best matching. Because the free zone list is organized as a single linked list, no further search is required. When the free list is sorted by size, the first fit is as fast as the best fit, and the next fit is meaningless here.

Another allocation algorithm is the Quick Fit algorithm, which maintains separate linked lists for free areas of the usual size. For example, if you have a table with n entries, the first entry of the table is a pointer to a free list header of size 4 KB, the second entry is a pointer to a free list header of size 8 KB, the third entry is a pointer to a free list header of size 12 KB, and so on. For example, a free area of 21 KB can be placed in either a 20 KB linked list or a special free area linked list.

The fast matching algorithm is also very fast to find a designated free area, but it has a common disadvantage, like all schemes that sort the free area by size, that is, when a process terminates or is swapped out, the process of finding its adjacent blocks and seeing if it can be merged is very time-consuming. Without consolidation, memory quickly splits into small free areas that processes can’t use.

Page replacement algorithm

Page replacement has a lot of algorithms, let’s look at it

When a missing page occurs, the operating system selects a page to swap out to make room for a new page. If the page to be paged out has been modified in memory, it must be written to disk to keep the disk copy up to date. If the page has not been modified and the copy on disk is already up to date, there is no need to rewrite. You can simply use the page you call in to overwrite the page you want to remove.

Although a page can be randomly selected for replacement in case of page missing interruption, the system performance will be improved if an uncommon page is selected every time. If a frequently used page is swapped out, then that page can be reused for a short period of time, which can cause additional performance overhead. There are many page replacement algorithms for topics about a page, which have been proven in theory and practice.

Let’s take a look at what page replacement algorithms are available.

Optimal page replacement algorithm

Optimal page replacement algorithms are easy to describe, but difficult to implement in practice. It works like this: When a page miss interrupt occurs, one of these pages is referenced on the next directive (the page containing that directive). Other pages may not be accessed until 10, 100, or 1000 instructions later. Each page can be marked with the number of instructions to execute before the page is first accessed.

The optimized page algorithm indicates that the largest page should be marked. If a page will not be used for 8 million instructions and another page will not be used for 6 million instructions, the previous page will be replaced to defer the interruption of missing pages that need to be called to that page. Computers, like humans, put things off for as long as possible.

The biggest problem with this algorithm is that it can’t be implemented. When a page-missing interrupt occurs, the operating system has no way of knowing when each page will be accessed next. This algorithm is not used in practice at all.

Page replacement algorithms have not been used recently

To enable the operating system to collect information about page usage, most computers that use virtual addresses have two status bits, R and M, associated with each page. R is set every time a page is referenced (read or written) and M is set every time a page is written (that is, modified). These bits are contained in each page table entry, as shown below

Because these bits are updated with each access, it is important that the hardware set them. Once a bit is set to 1, it remains 1 until the next time the operating system changes the bit.

If the hardware does not have these bits, you can simulate them using the operating system’s page-miss interrupt and clock interrupt mechanisms. When starting a process, mark all of its pages as out of memory. Any access to any page raises a page miss interrupt, at which point the operating system can set the R bit (in its internal table), modify the page entry to point to the correct page, set it to READ ONLY mode, and then restart the instruction that caused the page miss interrupt. If the page is subsequently modified, another missing page exception occurs. This allows the operating system to set M bits and set the page’s mode to READ/WRITE.

A simple page replacement algorithm can be constructed using R and M bits: when a process is started, the operating system sets both bits of all its pages to 0. R bits are periodically cleared (interrupted at each clock). Used to separate recently unreferenced pages from referenced pages.

When a page miss interrupt occurs, the operating system examines all pages and classifies the current value into four categories based on their R and M bits:

  • Class 0: no reference to R, no modification to M
  • Class 1: no reference to R, modified M
  • Class 2: reference R without modifying M
  • Class 3: accessed R, modified M

Although it may seem as if the first type of pages cannot be implemented, they occur when the R bits of the third type of pages are cleared by a clock interrupt. Clock interrupts do not clear M bits, because this information is needed to know whether to write back to disk. Clearing R but not M results in a category of pages.

The NRU(Not Recently Used) algorithm randomly deletes a page from the least numbered non-empty class. The idea behind this algorithm is that it is better to eliminate a modified but unvisited page in a clock (about 20 ms) than a heavily referenced unmodified page. The main advantage of NRU is that it is easy to understand and can be implemented effectively.

Fifo page replacement algorithm

Another less expensive approach is to use a FIFO(first-in, first-out) algorithm, a type of data structure that can also be used In page replacement algorithms. The operating system maintains a linked list of all the pages currently in memory, with the earliest entries placed at the top and the latest entries placed at the bottom. When a missing page exception occurs, the header page is removed and a new page is added to the end of the table.

Fifo pages are probably the simplest page replacement algorithm. In this algorithm, the operating system keeps track of all the pages in memory in the linked list. Let’s take a look at an example. (I was a little confused when I first looked at this algorithm, but later I understood it, and I still didn’t like it.)

  • When I initialize, I don’t have any pages, so the first time I check if page 1 is in the list, it’s not in the list, so it’sMISS, page 1 enters the linked list, and the fifO direction of the linked list is shown in the figure.
  • Similarly, the second check will first check if page 2 is in the linked list. If page 2 is not in the linked list, then page 2 will enter the linked list with the status ofMISSAnd so on.
  • Let’s look at the fourth time, where the linked list is zero1 2 3The fourth time the page is checked2If it’s in the list, if it’s in the list, if it’s in the list, then it’s in the stateHIT“, will not be in the team and out of the team operation, the fifth time is the same.
  • Now let’s do the sixth one, and the list is still going to be1 2 3Because the entry to the list operation was not performed before, the page5If no page 5 is found in the linked list, enter the linked list on page 5 and exit the linked list on page 2. The order of the linked list after execution is2, 3, 5.

Second chance page replacement algorithm

The FIFO linked list page we have learned above has a defect, that is, out and in the chain does not check, so it will be easy to replace frequently used pages out, in order to avoid this problem, we make a simple change to the algorithm: We check the R bit of the oldest page. If it is 0, then the page is the oldest and not being used, and the page will be swapped out immediately. If the R bit is 1, the bit is cleared and the page is placed at the end of the list, changing its load time as it was just put in. Then keep searching.

This algorithm is called the second chance algorithm, and as we see below, pages A through H remain in the linked list, sorted by the time they arrived in memory.

A) pages arranged on a first-in, first-out basis; B) page linked list when abnormal interruption of page missing occurs at time 20 and R bit of A has been set.

Suppose the page missing exception occurs at time 20, when the oldest page is A, which arrives at time 0. If the R bit of A is 0, it will be flushed out of memory, either written back to disk (if it has been modified), or simply discarded (if it has not been modified). On the other hand, if its R bit is already set, put A at the end of the list and reset the load time to the current moment (20), then clear the R bit. Then continue to search for suitable pages starting from page B.

Looking for a second chance are pages that have not been visited in the most recent clock interval. If all pages are visited, the algorithm is reduced to a pure FIFO algorithm. In particular, assume that all pages in Figure A are set to R bits. The operating system moves pages to the end of the list, each time clearing the R bit as it is added to the end. Finally, the algorithm will return to page A, at this time the R bit has been cleared, then page A will be executed out of the chain processing, so the algorithm can end normally.

Clock page replacement algorithm

Even the second page replacement algorithm mentioned above is a reasonable algorithm, but it often moves pages around the linked list, reducing efficiency, and this algorithm is not necessary. A good way to do this is to keep all the pages in a circular list, like a clock face, with one hand pointing to the oldest page. As shown in the figure below

When the missing page error occurs, the algorithm first checks the page pointing to the needle, if its R bit is 0, the page is eliminated, and the new page is inserted into this position, and then the needle forward one; If the R bit is 1, clear the R bit and move the hand forward one position. Repeat this process until you find a page location with R bit 0. Knowing how this algorithm works, you can see why it is called the clock (CLOKC) algorithm.

Least recent use of page replacement algorithms

One explanation for the least recent use of the page replacement algorithm would be the following: pages that are frequently used in the first few instructions are likely to be used in the next few instructions. Conversely, pages that have not been used for a long time may not be used for some time to come. This idea reveals an algorithm that can be implemented to replace the page that has not been used for the longest time when a missing page is interrupted. This policy is called LRU(Least Recently Used), and the Least recent page replacement algorithm is Used.

Although LRU is theoretically possible, it is costly in the long run. To fully implement LRU, a linked list of all pages is maintained in memory, with the most frequently used pages at the head of the table and the least recently used pages at the bottom. The hard part is updating the entire list with each memory reference. Finding a page in the list, deleting it, and then moving it to the head of the table is a time-consuming operation, even with hardware.

However, there are other ways to implement LRU through hardware. Let’s consider the simplest way first. This approach requires the hardware to have a 64-bit counter that increments automatically at the completion of each instruction, and each page table must have a field large enough to hold the value of this counter. After each memory access, the current value is saved to a page table entry for the page being accessed. When a missing page exception occurs, the operating system checks the value of the counter in all page entries and finds the page with the smallest value, which is the least used page.

Simulate LRU with software

Although the ABOVE LRU algorithm is feasible in principle, few machines have those special hardware. This is the hardware implementation, so now consider implementing LRU in software. One possible solution is the NFU(Not Frequently Used, least common) algorithm. It requires a software counter associated with each page, initialized to 0. At each clock interrupt, the operating system scans all the pages in memory and adds each page’s R bit (0 or 1) to its counter. This counter roughly tracks the frequency of visits to individual pages. When a missing page exception occurs, the page with the smallest counter value is replaced.

The main problem with NFU is that it doesn’t forget anything, right? For example, in a multiple (scan) compiler, pages that are frequently used in the first scan will also have a high count in subsequent scans. In fact, if the execution time of the first scan happens to be the longest of each scan, the subsequent pages will always be counted less often than the first page. The result is that the operating system will replace pages that are useful rather than pages that are no longer used.

Fortunately, it only takes a simple modification to the NFU to make it emulate the LRU, and this modification has two steps

  • First, move the counter one bit to the right before R bit is added;
  • In the second step, the R bit is added to the leftmost bit instead of the rightmost bit.

The modified algorithm is called the aging algorithm, and the figure below illustrates how the aging algorithm works.

We assume that the R bits of pages 0-5 in the first clock cycle are 1, 0, 1, 0, 1, 1, (that is, page 0 is 1, page 1 is 0, page 2 is 1, and so on). That is, between zero clock cycles and one clock cycle, 0, 2, 4, 5 are all referenced, thus setting their R bits to 1 and the rest to 0. R bits are added to the left after the associated six counters have been moved to the right, like a in the figure above. The remaining four columns show the six counter changes over the next four clock cycles.

When a missing page exception occurs, the page with the smallest counter value is replaced (that is, removed). If a page has not been visited for the previous four clock cycles, its counter should have four consecutive zeros, so its value must be smaller than that of a page that has not been visited for the previous three clock cycles.

There are two important differences between this algorithm and the LRU algorithm: look at e, the third and fifth columns in the figure above

They have not been accessed in either clock cycle, and both pages have been referenced in the previous clock cycle. According to the LRU algorithm, if a substitution is required, one of the two pages should be selected. So, which one should I choose? The problem now is that we don’t know which of these pages were accessed later from clock cycles 1 to clock cycles 2. Because only one bit is recorded in each clock cycle, it is impossible to tell which page is referenced first and which last in a clock cycle. Therefore, what we can do is replace page 3, which has not been visited in cycle 0-1, while page 5 is referenced.

A second difference between LRU and pre-aging is that during aging, the counter has a limited number of bits (8 bits in this case), which limits the access record historically. If both pages have a counter of 0, we can simply choose one of them to replace. In fact, it’s possible that one page was visited 9 clock cycles ago and the other was visited 1000 clock cycles ago, but we can’t see that. In practice, if the clock cycle is 20 ms, 8 bits are generally sufficient. So we often use 20 ms as an example.

Working set page replacement algorithm

In the simplest paging systems, there are no pages in memory when the process is started. If the CPU tries to match the first instruction, it gets a page-missing exception, causing the operating system to load the page containing the first instruction. Other errors such as missing page exceptions caused by global variables and stacks usually follow. After a while, most of the pages the process needs are in memory, and the process starts to run with fewer page-missing exceptions. This strategy is called Demand Paging because pages are called in on demand rather than in advance.

A system reading all the pages in a large address space will cause many page missing exceptions, and therefore will not have enough memory to hold the pages. Fortunately, however, most processes do not work this way; they are accessed in a locality of Reference manner, which means that at any stage of execution, the program only refers to a small portion of them.

The set of pages that a process is currently using is called its working set, and if the entire working set is in memory, the process will not generate many page-missing interrupts until it reaches the next run phase (for example, the compiler’s next sweep). If the memory is too small to hold the entire working set, the process will run with a large number of page-missing interrupts, resulting in slow execution. Because it usually takes only a few nanoseconds to execute an instruction, it usually takes ten milliseconds to read a page from disk. If a program can only execute one or two instructions every 10 ms, it will take a long time to complete. If executing only a few instructions produces an interruption, the program is said to produce thrashing.

In multiprogram systems, it is common to move processes to disk (that is, remove all pages from memory) to give other processes a chance to occupy the CPU. One question is, what happens when the process wants to bring the page back to memory again? From a technical point of view, nothing needs to be done, and the process will continue to generate page-missing interrupts until its working set is called back into memory. Then, it takes 20, 100, or even 1000 page misses per process load, which is obviously too slow, and a significant amount of CPU time is wasted because it takes a few milliseconds for the CPU to process a page misses.

As a result, many paging systems try to keep track of the working sets of a process and ensure that they are called into memory while the process is running. This approach is called the Working Set Model. It is designed to reduce the number of page missed interrupts. The process of first loading working set pages before a process runs is called prepaging, and the working set changes over time.

According to research, most programs do not evenly access the address space, and access tends to be concentrated on a small number of pages. A memory access may fetch an instruction, it may fetch data, or it may store data. At any time t, there exists a collection containing the pages visited by all the last k memory accesses. This set w(k,t) is the working set. W (k,t) is a monotone decreasing function of k because the last k = 1 visit will definitely visit the page visited by the last K > 1 visit. As k increases, w(k,t) does not get infinitely larger, because it is impossible for a program to access more pages than the maximum number of pages it can hold.

In fact, most applications only access a small set of pages at random, but the set changes slowly over time, so why does the curve rise quickly at first and slowly when k is large? To implement the working set model, the operating system must keep track of which pages are in the working set. The total amount of CPU time actually used by a process from the time it started executing to the current time is often referred to as the current actual elapsed time. The working set of a process can be called the set of pages it has visited in the last t seconds of actual running time.

Here is a simple description of the working set page replacement algorithm, the basic idea is to find a page not in the working set and eliminate it. Below is a partial list of machine pages

Because only pages that are in memory can be eliminated as candidates, the algorithm ignores pages that are not in memory. Each entry contains at least two pieces of information: the approximate time the page was last used and the R (access) bit. Blank rectangles indicate that the algorithm does not require other fields, such as number of page boxes, protection bits, and modification bits.

The algorithm works as follows, assuming the hardware wants to set R and M bits. Similarly, a periodic clock interrupt within each clock cycle causes the software to clear Referenced bits. On each missing page exception, the page table is scanned to find a suitable page to replace.

As each page entry is processed, the R bits need to be checked. If the R bit is 1, the current time is written to the last used time field of the page entry, indicating that the page was in use when the page missing exception occurred. Because the page was accessed during the current clock cycle, it should appear in the working set rather than be deleted (assuming t spans multiple clock cycles).

If the R bit is 0, then the page has not been accessed during the current clock cycle and should be removed. To see if it should be removed, its lifetime (the current virtual time – the last time it was used) is calculated and compared with t. If the lifetime is greater than t, the page is removed from the working set and replaced with a new page. Then continue scanning and updating the remaining entries.

However, if the R bit is 0 but the lifetime is less than or equal to t, then the page should be in the working set. The page is temporarily saved, but the page that has lived the longest (that is, the minimum amount of time it was last used) is remembered. If you scan the entire page table and find no pages suitable for replacement, it means that all pages are in the working set. In this case, if one or more pages with R = 0 are found, the page with the longest life is eliminated. In the worst case, during the current clock cycle, all pages have been visited (i.e., all pages have R = 1), so a random page is selected for elimination, and if so, an unvisited page, i.e., a clean page, is preferred.

Working set clock page replacement algorithm

When a page missing exception occurs, it is necessary to scan the entire page table to determine the eliminated page, so the basic working set algorithm is still a waste of time. An improvement over the basic working set algorithm is based on the clock algorithm but uses the working set information, called WSClock(Working set clock). Because of its simple implementation and high performance, it is widely used in practice.

As with the clock algorithm, the required data structure is a circular list of page boxes as elements, as shown below

Initially, the table was empty. When the first page is loaded, load it into the table. As more pages are added, they form a circular structure. Each entry contains the last use time from the base working set algorithm, along with R bits (indicated) and M bits (not indicated).

As with the clock algorithm, the page to which the pointer points is first checked for each missing page exception. If the R bit is set to 1 and the page has been used in the current clock cycle, then the page is not suitable for elimination. Then set the R position of the page to 0 and the pointer to the next page, and repeat the algorithm. The serialized state of the event is shown in Figure B.

Now consider what happens when the page to which the pointer points is R = 0. See Figure C. If the page is older than t and has been accessed, then the page is not in the working set and a copy of the page is on disk. Request to reload a new page and place the new page in it, as shown in Figure D. On the other hand, if a page has been modified, it cannot be reapplied because there is no valid copy of the page on disk. To avoid process switching due to scheduled write operations, the pointer moves on and the algorithm moves on to the next page. After all, it is possible to have an old, unmodified page that can be used immediately.

In principle, all pages can be scheduled due to disk I/O in a certain clock cycle. To reduce disk blocking, you need to set a limit of n pages written back. Once this limit is reached, no new write operations are allowed to be scheduled.

So the question is, the pointer will go around and come back to the origin, and if it goes back to the origin, what happens to its starting point? There are two cases:

  • At least one write operation is scheduled. Procedure
  • No write operation is scheduled

In the first case, the pointer simply moves around looking for a page that hasn’t been modified. Because one or more writes have been scheduled, eventually a write completes and its page is marked as unmodified. The first unmodified page encountered by a replacement is not necessarily the first page to be scheduled for write operations, as the hard disk driver may reorder the writes to optimize performance.

In the second case, all pages are in the working set, otherwise at least one write operation would be scheduled. Due to the lack of additional information, the simplest way is to replace an unmodified page to use. The location of the unmodified page needs to be recorded in the scan. If no unmodified page exists, the current page is selected and written back to disk.

Page replacement algorithm summary

We have now studied a variety of page replacement algorithms, now we come to a simple summary, the summary of the algorithm is summarized as follows

algorithm annotation
The optimal algorithm Not implementable, but can be used as a benchmark
NRU(not recently used) algorithm It’s very similar to the LRU algorithm
FIFO(First in, first out) algorithm Important pages may be discarded
Second chance algorithm Compared with FIFO has a great improvement
The clock algorithm The actual use
LRU(least recent) algorithm Good, but hard to achieve
NFU(least frequently consumed) algorithm Very similar to LRU
Aging algorithm Efficient algorithm for LRU approximation
Working set algorithm It’s expensive to implement
Working set clock algorithm A more efficient algorithm
  • The optimal algorithm replaces the last page to be visited in the current page. Unfortunately, there is no way to determine which page will be the last to visit, so this algorithm can’t actually be used. However, it can be used as a yardstick against which other algorithms can be measured.

  • The NRU algorithm classifies the page atmosphere into four categories according to the states of R and M bits. Select a page at random from the least numbered category. The NRU algorithm is easy to implement, but not very good performance. There are better algorithms.

  • The FIFO tracks the order in which pages are loaded into memory and places them in a linked list. It is possible to delete the oldest pages that are still in use, so this algorithm is not a good choice either.

  • The second chance algorithm is a modification to the FIFO that checks to see if a page is still in use before deleting it. If the page is in use, it is reserved. This improvement greatly improves performance.

  • The clock algorithm is another implementation of the second chance algorithm. The clock algorithm performs about the same as the second chance algorithm, but takes less time to execute the algorithm.

  • LRU algorithm is a very good algorithm, but it is difficult to implement without special hardware (TLB). If you don’t have hardware, you can’t use the LRU algorithm.

  • NFU algorithm is similar to LRU algorithm, but its performance is not very good.

  • The aging algorithm is a closer implementation of the LRU algorithm and can be better implemented, so it is a good choice

  • The last two algorithms use the working set algorithm. The working set algorithm provides reasonable performance overhead, but its implementation is complicated. WSClock is another variant that not only provides good performance, but can be implemented efficiently.

In short, the best algorithms are the aging algorithm and WSClock algorithm. They are based on LRU and working set algorithms respectively. They all have good performance and can be implemented effectively. There are other good algorithms out there, but actually these two are probably the most important.

Algorithms in the file system

File systems use algorithms during backup. File backup is classified into logical dump and physical dump

Physical dump and logical dump

The main advantage of a physical dump is that it is simple and extremely fast (running at disk speed basically), but the disadvantage is that a full backup cannot skip a specified directory, nor can an incremental dump, nor can requests for personal files be recovered. Therefore, most sentences do not use physical dumps, but logical dumps.

Logical dumps start at one or more specified directories and recursively dump files and directories that have changed since the specified date. Thus, in a logical dump, there is a carefully identified set of directories and files on the dump disk, which makes it easy to restore specific files or directories on request.

Since logical dumps are the most common approach, let’s look at a general algorithm for logical dumps. This algorithm is widely used on UNIX systems, as shown in the figure below

The file system to be dumped, where boxes represent directories and circles represent files. The yellow item table has been modified since the last dump. Each directory and file is marked with its inode number.

This algorithm dumps all directories (including unmodified directories) on the path to the modified file or directory for two reasons. The first is the ability to recover the dumped files in the file systems of different computers. In this way, dumped and re-stored programs can be used to transfer the entire file system between two computers. The second reason is the ability to do incremental recovery on individual files.

Logical dump algorithms need to maintain a bitmap indexed by an inode, with each inode containing several bits. As the algorithm progresses, these bits in the bitmap are set or cleared. The implementation of the algorithm is divided into four stages. The first phase starts from the start directory (the root directory in this example) and checks all directory entries. For each modified file, the algorithm marks its inode in the bitmap. The algorithm also marks and recursively inspects each directory (whether modified or not).

At the end of phase 1, all modified files and all directories are marked in the bitmap, as shown below

In theory, the second phase iterates recursively through the tree again, removing any markup in the tree that does not contain the modified file or directory. The results of this phase are as follows

Note that the directories with inode numbers 10, 11, 14, 27, 29, and 30 have been unmarked because their contents have not been modified. They don’t dump either. In contrast, the inodes 5 and 6 are themselves dumped even though they have not been modified, because this information is needed to recover today’s changes on a new machine. In order to improve the efficiency of the algorithm, the directory tree traversal of the two stages can be combined into one.

Now that you know which directories and files must be dumped, which is what is marked in figure B above, the third stage algorithm scans these inodes in order of node numbers and dumps all directories marked for dump, as shown below

For recovery, each dumped directory is prefixed with the attributes of the directory (owner, time).

Finally, in phase four, the files marked in the figure above are dumped, again prefixed by their file attributes. At this point, the dump ends.

Restoring a file system from a dump disk is simple. To start, you need to create an empty file system on disk. Then restore the last full dump. Since the directory appears first on tape, the directory is restored, the skeleton of the file system is given, and the file system itself is restored. Full storage is followed by the first incremental storage, then the process is repeated a second time, and so on.

Despite the simplicity of logical storage, there are some tricky problems. First, since the free block list is not a file, it needs to be rebuilt from scratch after all the dumped files have been recovered.

Another question is about links. If a file is linked to two or more directories and the file can only be restored once, then all directories pointing to the file must be restored.

Another problem is that UNIX files actually contain a lot of holes. It is legal to open a file, write a few bytes, then find an address in the file that is offset by some distance and write more bytes. But the blocks in between do not belong to the file itself, and therefore should not be used for file dumps and recovery.

Finally, special files, named pipes, and similar files should not be dumped regardless of which directory they belong to.

Algorithms in I/O

In I/O disk scheduling also appeared a lot of algorithms, on the addressing and disk arm rotation will affect the algorithm, let’s take a look

Generally, the disk fast read/write time is determined by the following factors

  • Seek time – The seek time is the time to move the disk arm to the disk block to be read
  • Rotation delay – The time required to wait for the appropriate sector to rotate under the head
  • The actual reading or writing time of data

These three time parameters are also the disk seek process. In general, the seek time has the greatest impact on the total time, so effectively reducing the seek time can improve disk read speed.

If the disk driver receives requests one at a time and completes them in the order they are received, which is first-come, first-served (FCFS), it is difficult to optimize the seek time. Because each request is processed sequentially, no matter what the order, it is possible to wait for one disk to rotate one week after reading, while the other cylinders can read immediately, in which case each request will queue up as well.

Normally, while the disk is seeking, other processes will make additional disk requests. The disk driver maintains a table in which cylinder numbers are recorded as indexes. Unfinished requests for each cylinder form a linked list. The linked list heads are stored in the corresponding entries in the table.

An alternative to the first-come-first-served algorithm is to use the shortest path First (SSF) algorithm, which is described below.

If a request for 11, 2, 4, 14, 8, 15, 3 occurs simultaneously while addressing track 6, if the first come, first served principle is used, as shown in the figure below

We can calculate that the number of disks the disk arm spans is 5 + 9 + 2 + 10 + 6 + 7 + 12 = 51, which is equivalent to 51 disk spans. If using shortest path first, we can calculate the number of disk spans

The number of disks across is 4 + 1 + 1 + 4 + 3 + 3 + 1 = 17, which saves twice as much time as 51.

However, the shortest path first algorithm is not perfect, this algorithm still has a problem, that is the priority problem,

A prototype for reference here is the elevator in our daily life, which uses an elevator algorithm for scheduling to meet the conflicting goals of coordinated efficiency and fairness. The elevator will generally keep moving in one direction until there are no requests in that direction, and then change direction.

The elevator algorithm maintains a binary bit, which is the current direction bit: UP or DOWN. When a request is processed, the disk or elevator driver checks the bit, and if the bit is UP, the disk arm or elevator bin moves to the next higher level to drop the pending request. If the high level has no outstanding requests, take the opposite direction. When the direction bit is DOWN and there is a low level request, the disk arm will turn to this point. If it doesn’t, it just stops and waits.

Let’s take an example to describe the elevator algorithm. For example, each cylinder is served in the order of 4,7,10,14,9,6,3,1. Then its flow chart is as follows

So the number of disks the elevator algorithm needs to span is 3 + 3 + 4 + 5 + 3 + 3 + 1 = 22

Elevator algorithms are generally inferior to SSF algorithms.

Another optimization can be made using some disk controllers that provide a way for the software to check the current sector number below the head. If there are two or more requests waiting to be processed for the same cylinder, the driver can issue a request to read or write the sector to be passed through the head next time.

It is important to note here that when a cylinder has multiple tracks, successive requests may be made for different tracks. This choice is not costly, since the selection of the head requires no disk arm movement and no rotation delay.

Seeking time and rotation latency are the biggest performance issues for disks, so reading only one or two sectors at a time is inefficient. For this reason, many disk controllers always read multiple sectors and cache them, even when only one sector is requested. Typically a sector is read at the same time as the track in which the sector is located or all remaining sectors are read, depending on how much space is available in the controller’s cache.

The disk controller cache is somewhat different from the operating system cache in that the disk controller cache is used to cache blocks that are not actually requested, whereas the operating system maintains a cache consisting of blocks that are explicitly read and that the operating system will assume will still be used frequently in the near future.

When there are multiple drives on the same controller, the operating system should maintain a separate outstanding request table for each drive. As soon as one of the drives is idle, a seek request should be made to move the disk arm to the next requested cylinder. If no disk arms are in the correct position when the next seek request arrives, the driver issues a new seek command on the drive that has just finished transferring and waits to check which drive is idle when the next interrupt arrives.

Algorithms in deadlocks

One of the strategies for dealing with deadlocks is to ignore the effects of deadlocks (surprise), and there is a method called the Ostrich algorithm

The simplest solution is to use the Ostrich algorithm, bury your head in the sand and pretend the problem is not happening. Everyone has a different reaction to this question. Deadlocks are considered unacceptable by mathematicians and must be prevented by effective strategies. Engineers want to know how often problems occur, how many times the system crashes for other reasons, and the consequences of deadlocks. If deadlocks occur infrequently and often crash the system due to hardware failures, compiler errors, and other operating system problems, most engineers will not fix deadlocks.

Several algorithms have appeared in deadlock detection

Deadlock detection for multiple resources of each type

If multiple identical resources exist, another method is needed to detect deadlocks. Deadlocks in n processes from P1 -> Pn can be detected by constructing a matrix.

Now we provide a matrix-based algorithm to detect deadlocks in n processes from P1 to Pn. Assume that the resource type is M, E1 indicates resource type 1, E2 indicates resource type 2, and Ei indicates resource type I (1 <= I <= m). E is the existing resource vector, representing the total number of existing resources of each type.

Now we need to construct two arrays: C represents the Current Allocation matrix and R represents the Request matrix. Ci represents the number of resources held by Pi for each type of resource. So, Cij represents the amount of resource J that Pi holds. Rij represents the amount of resource j that Pi needs to obtain

In general, the number of allocated resources j is added to all available resources = the total number of such resources.

Deadlock detection is based on vector comparison. Each process is initially unmarked, and the algorithm will start to mark the process. Once the process is marked, it indicates that the process is executed and will not enter the deadlock. When the algorithm ends, any process that is not marked will be considered as a deadlock process.

Above we discussed two ways to detect deadlocks. Now that you know how to detect deadlocks, when do you do it? In general, there are two criteria:

  • Detecting every time a resource request is made can take up expensive CPU time.
  • Check every k minutes or when the CPU usage drops below a certain level. For CPU efficiency reasons, if a certain number of deadlocked processes are reached, there are not many processes to run, so the CPU is often idle.

And deadlock avoidance algorithms

Banker’s algorithm

Banker algorithm is a scheduling algorithm proposed by Dijkstra in 1965, which itself is a deadlock scheduling algorithm. Its model is based on a banker in a town who promises a certain amount of credit to a customer in the town. All the algorithm has to do is determine whether the request will enter an unsafe state. If so, the request is rejected, and if the system is secure after the request, the request is accepted.

For example, in the following example, the banker provided a total of 15 units of loan line to all the urban residents, each unit represents $1K, as shown below

Urban residents like to do business, so there is A loan involved, and each person has A different maximum loan. At one point, A/B/C/D’s loan amount is as follows

The total loan amount of each person above is 13, which will soon approach 15. The banker can only lend money to A and C, but can delay B and D. Therefore, A and C can finish first to release the loan limit, so as to meet the loans of other residents. This is a safe state.

If the total amount of each person’s requests exceeds or approaches 15, you will be in an unsafe state, as shown below

In this way, each person can borrow at least two units, and if one of them initiates a maximum loan request, the system will be in a deadlock state.

One caveat: insecurity does not necessarily cause deadlocks, since customers do not necessarily need their maximum loan lines, but bankers are not taking chances.

The banker algorithm checks each request to see if it causes an unsafe state. If it does not, it accepts the request. If it does, defer the request.

Similarly, there are multiple sources of banker algorithms that readers can learn for themselves.