Learn file systems from the following aspects: 1. How to record and organize file data blocks? 2. How do I allocate disk blocks? 3. How do I reclaim disk blocks

Disk structure

Principle of the disk

As I mentioned in my last article on operating system memory management, the reasons for the large gap between memory and hard disk speed are as follows:

How memory speeds faster than hard disks: Memory speeds faster than hard disks (not a little faster, but a lot faster) because they are not stored and read in the same way.

  • Hard disks are mechanical structures that read data through the rotation of magnetic heads. A typical desktop hard disk is 7200 revolutions per minute, while a laptop hard disk is 5400 revolutions per minute.
  • Memory, on the other hand, has no mechanical structure. It accesses data electronically.

Memory accesses data electrically, essentially because RAM memory stores data electrically. But because they store their data electrically, they are lost when power goes out. Therefore, memory is only a temporary space for data to linger, while hard disk is permanent, data will not disappear after power failure.

Magnetic recording principle, in simple terms, magnetic surface memory through the relative movement of the magnetic head and the recording medium to complete the read and write operations. To read data as an example, the surface of the magnetic memory with magnetic induction line above, but is not the same as the direction of the magnetic induction line, head is equivalent to a coil, when he across the surface of the magnetic, there is a cut the movement of magnetic induction line, magnetic surface again due to the direction of the magnetic induction line position is different, so when the head across, can produce different direction of current, you can send 0, The 1 signal also reads the 0,1 binary code stored on disk.

In order not to learn so abstract, let’s learn a little bit about the disk hardware structure of the file storage, know how the computer to store and retrieve file data.

From Host to Disk: Host -> Disk Controllers -> Disk Drives -> Disks

The disk drive receives the commands of the host and converts them into the control commands of the disk drive to convert data formats and transfer data between the host and the drive, thus controlling the read and write of the drive. Computers can also hang on more than one disk, so there will be more than one disk drive.

Disk, track, sector

The surface of the disk is made up of magnetic materials that can be used to record binary data. The round thing is the disk, and the head arm turns the head to go to different tracks. The concept of tracks is as follows:

The disk surface is divided into tracks. Such a “circle” is a track, and a track (a circle) is divided into sectors. Each sector is a “disk block”.

Each sector stores the same amount of data (e.g., 1KB). The sector on the innermost track has the smallest area, so its data density is the largest.

Read the data

You need to move the “head” to the track where the sector you want to read/write is. The disk will spin and allow the target sector to slide under the head before the sector is read/written.

So the whole mechanism is pretty simple, not very complicated.

Cylinder number, disk number, sector number locate a block

The above is only a disk surface structure, in fact, there are many such disk surface accumulative accumulation, as follows:

Each disk has a magnetic head, which is moved inward or outward by a magnetic arm to read the sectors of the different disk.

Number of magnetic heads: equal to the number of recording surfaces; Cylinder number: Indicates that each disk has multiple tracks. Sector number: indicates the number of sectors on each track;

Can be used (cylinder number, disk number, sector number) to locate any “disk block”, can read a “block” number according to the file address,

Reading and writing data process:

  1. Move the magnetic arm according to the “cylinder number”, so that the magnetic head points to the specified cylinder;
  2. Activate the magnetic head corresponding to the specified disk surface;
  3. As the disk rotates, the specified sector is passed under the head, thus completing the reading/writing of the specified sector.

Note that 10 discs have only 18 heads, because the top and bottom faces do not record any information, so there are no heads.

Disk classification

The records used on the old phonographs were very similar to our disks, except that the phonograph had only one head, while the hard disk had two heads, and the disk rotated at high speed between the two heads, as follows:

That is to say, the mechanical hard disk is the upper and lower disk surface simultaneously into the data read. And the mechanical disk rotation speed is much higher than the record (the current mechanical disk common speed is 7200 r/min), so the mechanical disk in reading or writing data, very afraid of shaking and bumping. In addition, due to the high speed of the mechanical hard disk, if there is dust inside, it will cause damage to the magnetic head or disk. Therefore, the internal of the mechanical hard disk is closed, and it is forbidden to disassemble the mechanical hard disk unless it is in a dust-free environment.

Disk summary

Disk initialization

Disk initialization

  1. Perform low-level formatting (physical formatting) to divide each track of the disk into sectors. A sector can be divided into three parts: header, data area (such as 512B size), and tail. Data structures required for sector management are stored in the header and tail, including sector check codes (such as the parity check and CRC cyclic redundancy check codes, which are used to check whether errors occur in the data in the sector).

  2. Partition the disk, each partition by a number of cylinders (that is, into the familiar C, D, E disk)

  3. Create a file system by logical formatting. This includes creating the root directory of the file system, initializing the data structures used for storage space management (such as bitmaps, free partition tables)

Boot block

Computer startup requires a series of initialization work, which is done by executing the initialization program (bootstrappers).

The complete bootstrap program is placed on the boot block (boot block/boot partition) of the disk. The boot block is located at a fixed location on the disk. A disk with a boot partition is called a boot disk or system disk (C: disk).

ROM register is only stored in a very small “bootstrap loader”, boot when the computer first run “bootstrap loader”, through the execution of the program can find the boot block, and the complete “bootstrap program” read into memory, complete initialization.

Disk scheduling algorithm

Time required for a disk read/write operation

Seek time (Seek time)TS: The time it takes to move the head to the designated track before reading/writing data.

  1. It takes time to activate the head arm. Suppose the time is s;
  2. Moving the head also takes time. Assuming that the magnetic head moves at a constant speed, each track crossing takes m, and a total of N tracks need to be crossed. Then: Seek time TS = s + m*n

Delay TR: Time required for the head to be located to the target sector by rotating the disk. If the disk rotation speed is set to R (unit: RPM or RPM), the average latency TR = (1/2) x (1/ R) = 1/2 R

Transfer time Tt: indicates the time taken to read or write data from or to a disk. Assume that the disk speed is R, the number of bytes read/write is B, and the number of bytes on each track is N. Then: Transmission time Tt = (1/r) * (b/N) = B /(rN)

Review of five replacement algorithms for memory management request paging

In the last article on memory management, when requesting paging, I described five substitution algorithms when requesting paging, as follows:

Memory paging, paging in and out of pages requires disk I/O, which can be expensive, so a good page replacement algorithm should aim for fewer page misses.

The first come first served algorithm (FCFS) and the shortest seek time first (SSTF) are both used to assume that the initial position of the magnetic head is track 100, and multiple processes successively request access to track 55, 58, 39, 18, 90, 160, 150, 38, and 184. Take a look at the addressing time required for the various methods.

First come, first served algorithm (FCFS)

Scheduling is based on the order in which processes request access to disks.

According to FCFS rules, the magnetic heads start at track 100 and move to track 55, 58, 39, 18, 90, 160, 150, 38, and 184.

  1. 100-55=45 The number of moves required to move from # 100 to # 55
  2. 58 minus 55=3 The number of moves required to move from 55 to 58
  3. 58-39 = 19…
  4. 39-18 = 21
  5. 90-18 = 72

The head moved a total of 45+3+19+21+72+70+10+112+146 = 498 tracks.

Average movement required to respond to a request is 498/9 = 55.3 tracks (average search length)

Advantages: fairness; If the request to access the track is more concentrated, the algorithm performance has been calculated

Disadvantages: If there are a large number of processes competing to use the disk and the tracks requesting access are scattered, FCFS has poor performance and long seek times.

Minimum search time priority (SSTF)

The track that SSTF algorithm will process first is the track closest to the current head. It can guarantee the shortest seek time per time, but it does not guarantee the shortest total seek time. (In fact, is the idea of greedy algorithm, just choose the best, but not necessarily the best overall)

Total head movement (100-18) + (184-18) = 248 tracks, 248/9 = 27.5 tracks per request (average search length)

Advantages: Good performance, short average seek time

Disadvantages: May produce “hunger” phenomenon

Eg: In this example, if an access request for track 18 is processed and another access request for track 38 is processed, then another access request for track 18 is processed. If a steady stream of requests arrives for track 18 and 38, track 150, 160, and 184 will never be satisfied, creating a “hunger” phenomenon.

Hunger occurs because magnetic heads move back and forth in a small area

Scanning algorithm (SCAN)

The SSTF algorithm is hungry because the head can move back and forth in a small area. To prevent this problem, specify that the head can only move inwards if it is moved to the outermost track and inwards only if it is moved to the inmost track. This is the idea of the SCAN algorithm. It is also called an elevator algorithm because the head moves much like an elevator.

Assume that the track number of a disk ranges from 0 to 200, the initial position of the magnetic head is track 100, and the magnetic head is moving to the direction with the track number increasing. Multiple processes request access to track 55, 58, 39, 18, 90, 160, 150, 38, and 184 successively.

Total head movement (200-100) + (200-18) = 282 tracks, 282/9 = 31.3 tracks per request on average (average search length)

Advantages: good performance, short average seek time, no hunger phenomenon

Disadvantages:

  1. You can only change the direction of the head when you reach the end of the track. In fact, you don’t need to move the head any further to the right after processing the access request for track 184.
  2. The response frequency of SCAN algorithm for track at each position is not average (for example, if the magnetic head is moving to the right and track 90 has just been processed, the next request of track 90 needs to wait for the magnetic head to move a long distance; Soon after the request for track 184 was answered, the request for track 184 was answered again.)

LOOK scheduling algorithm

In SCAN, the direction of the head can be changed only when it reaches the very edge of the track. In fact, there is no need to move the head further to the right after processing the access request for track 184. The LOOK scheduling algorithm is designed to solve this problem. If there are no other requests in the magnetic head movement direction, the magnetic head movement direction can be changed immediately. (LOOK as you move, hence LOOK)

Assume that the track number of a disk ranges from 0 to 200, the initial position of the magnetic head is track 100, and the magnetic head is moving to the direction with the track number increasing. Multiple processes request access to track 55, 58, 39, 18, 90, 160, 150, 38, and 184 successively.

Total head movement (184-100) + (184-18) = 250 tracks, 250/9 = 27.5 tracks per request on average (average search length)

Advantages: Compared with the SCAN algorithm, the magnetic head direction does not need to be moved to the outer or inner part every time to change, which shortens the seeking time further.

Cyclic scanning algorithm (C-SCAN)

SCAN algorithm has uneven response frequency for each position track, and C-SCAN algorithm is designed to solve this problem. Specifies that track access requests are processed only when the head is moving in a particular direction, and that on return it is quickly moved straight to the beginning without processing any requests.

Assume that the track number of a disk ranges from 0 to 200, the initial position of the magnetic head is track 100, and the magnetic head is moving to the direction with the track number increasing. Multiple processes request access to track 55, 58, 39, 18, 90, 160, 150, 38, and 184 successively

The head moved (200-100) + (200-0) + (90-0)= 390 tracks in total, and 390/9 = 43.3 tracks (average search length) in response to a request.

Advantages: Compared to SCAN, the response frequency for each position track is very even.

Disadvantages: You can only change the direction of the head when you reach the end of the track. In fact, you don’t need to move the head to the right after processing the access request for track 184. Also, when the head returns, it actually only needs to return to track 18, not to the most marginal track. In addition, the average seek time is longer than that of SCAN algorithm.

C-look scheduling algorithm

The main disadvantage of the C-SCAN algorithm is that the direction of head movement can be changed only when it reaches the most edge track, and the head does not necessarily need to return to the most edge track. The C-look algorithm is designed to solve this problem. If there are no more track access requests in the direction the head is moving, the head can be returned immediately and only to the location where the track access request was made.

Again, the above example:

Total head movement (184-100) + (184-18) + (90-18)= 322 tracks In response to a request 322/9 = 35.8 tracks (average search length)

Advantages: Compared with THE C-SCAN algorithm, the magnetic head direction does not need to be moved to the outer or inner part every time to change, which further shortens the seeking time

File allocation mode

My last article on operating system memory management, also from the continuous allocation of memory space to the discontinuous allocation of paged allocation, step by step optimization before there is paged allocation. History repeats itself, and file systems have both continuous allocation and discontinuous allocation.

Continuous distribution

Sequential allocation requires each file to occupy a contiguous set of blocks on disk, similar to an array.

Features: To read a disk block, the head needs to be moved. The farther apart the accessed disk blocks are, the longer it takes to move the head.

If file A needs to expand, add another disk block (A total of four consecutive disk blocks are required). Because of the contiguous structure, the disk blocks occupied by file A must be contiguous. Therefore, file A can only be “moved” to the green area.

Advantages: Supports sequential access and direct access (i.e. random access); Consecutively allocated files are the fastest when accessed sequentially.

Disadvantages: inconvenient file expansion; The storage space usage is low, causing disk fragmentation.

This advantage and disadvantage can also occur in memory management using this approach, but for memory problems, you can read my last article.

Discontinuous allocation – implicit linking

Multiple discrete disk blocks belonging to the same file are linked into a linked list through the link pointer on each disk speed.

Data reading process: The user gives the logical block number I to be accessed, and the operating system finds the directory entry corresponding to the file (FCB, that is, file control block, as described below). Locate the start block ID (block 0) in the directory and read logical block 0 into the memory to know the physical block ID of logical block 1. Then, read logical block 1 and find the location of logical block 2…… And so on. Therefore, it takes a total of I +1 disk I/O to read the I logical block.

Disadvantages: the use of chain allocation (implicit link) mode of the file, only support sequential access, do not support random access, search efficiency is low. In addition, the pointer to the next disk block consumes a small amount of storage space.

Advantage: If you want to expand the file, you can find a free disk block, attach it to the end of the disk block chain of the file, and modify the FCB of the file for easy expansion. In addition, all free disk blocks can be used without memory fragmentation.

Discontinuous allocation – Displays links

The Pointers used to link the physical blocks of a file are explicitly stored in a table. FAT File Allocation Table

Note: Only one FAT is set for each disk. When booting up, FAT is read into memory and resident in memory. Each FAT entry is physically contiguous and has the same length. Therefore, the Physical block number field can be implied.

Read disk: The user gives the logical block NUMBER I to be accessed, and the operating system finds the directory entry (FCB) corresponding to the file. If I >0, the system queries the file allocation table FAT in memory to find the physical block number corresponding to logical block I. The conversion of logical block numbers to physical block numbers does not require a disk read operation.

Compared with the advantages of implicit allocation: the mode of distribution of chain links (explicit) files, support for sequential access, also supports random access (to access the logical block number, I don’t need go to 0 ~ before I – 1 logical block), as a result of the block number conversion process does not need to access the disk, so compared with implicit linking, access speed a lot. He has all the other advantages of implicit allocation.

The disadvantage is that the file allocation table needs to occupy a certain amount of storage space.

The index distribution

Index allocation allows files to be distributed among disk blocks discretely. The system creates an index table for each file, which records the physical blocks corresponding to each logical block of the file. (The function of an index table is similar to that of a page table in memory management — mapping between logical pages and physical pages.) Disk blocks that are stored in index tables are called index blocks. The disk blocks where file data is stored are called data blocks.

Actually last memory management understand, to understand the management of files is very easy, the segment type way of management is the index page, memory management and file management are very similar, because the CPU can't direct peripheral storage file system operation, will be made by first IO data is read into memory, and then file operation, so the more similar the file system and memory management system, The higher the efficiency of reading, the more convenient, so in order to reduce the difficulty, the operating system designers in the design of the time must be considered, as much as possible abstract, design into similar management mode.

Read data: The user gives the logical block number I to be accessed, and the operating system finds the directory entry (FCB) corresponding to the file… You can know the location of the index table from the directory entry. You can read the index table into the memory from external memory and search the index table to find the location of the logical block I in external memory.

As you can see, index allocation can support random access. File expansion is easy to implement (just allocate a free block to the file and add an index entry), but index tables take up some storage space.

Index assignment – link mode

Link scheme: If the index table is too large for one index block, multiple index blocks can be linked together for storage.

Problem: if the disk block size is 1KB and one index entry occupies 4B, a disk block can store only 256 index entries. If the size of a file is 256,256 KB = 65536 KB = 64MB, the file has 256,256 blocks, which corresponds to 256*256 index items. Therefore, 256 index blocks are required to store the file. These index blocks are linked by the link side.

If you want to access the last logical block of the file, you must find the last index block (the 256th index block), which is linked by Pointers, so you must read the first 255 in order.

Obviously, when the file is very large, it is very inefficient, and you need to keep iterating through the list. This leads to the following multi-level linked list.

Index allocation – Multi-level index

Multilevel indexes: Create multilevel indexes (similar to multilevel page tables). Make the first level index block point to the second level index block. The third layer and the fourth layer index blocks can also be established according to the requirements of the file size.

Extension: A UNIX system uses 1KB disk blocks and 4-byte disk addresses. What is the maximum size of a file if each I node has 10 direct entries and one primary indirect block, one secondary indirect block, and one tertiary indirect block?

A: For each I node, the direct entry (level 4) records 10 disk blocks, level 1 index records 256 disk blocks, level 2 index records 256^2 disk blocks, and level 3 index records (256^2)^2 disk blocks. The maximum file size is (10+2^8+2^16+2^32) x 1KB ≈ 4TB

Index allocation – Mixed indexes

Mixed index: a combination of multiple index allocation methods. For example, a file’s top-level index table contains both direct address indexes (pointing directly to data blocks), one-level indirect indexes (pointing to single-level index tables), and two-level indirect indexes (pointing to two-level index tables).

summary

File storage space management

I’ve spent a lot of time talking about file allocation, so let’s talk about recycling and managing free blocks.

How to manage the free block here?

Partition disks (C: disk, D: disk, E: disk, etc.)

The free list method

Applicable to “continuous distribution mode”.

If a file is deleted, the system reclaims blocks 15, 16, and 17

Free linked list method

Free disk block chain:

The operating system holds the pointer to the head and tail of the chain.

Allocation: If a file requests K disk blocks, remove K disk blocks from the chain head and modify the header pointer of the free chain.

Recycle: The recovered disk blocks are attached to the end of the chain in turn, and modify the end pointer of the free chain.

Free zone chain:

The operating system holds the pointer to the head and tail of the chain.

Allocation: If a file applies for K disk blocks, it can use first-fit and best-fit algorithms to search from the header and find a free disk area with the required size according to the algorithm rules and allocate it to the file. If no contiguous free blocks are available, you can also allocate blocks of different partitions to a file at the same time. Note that data such as chain Pointers and partition sizes may need to be modified after allocation.

Reclamation: If the reclamation area is adjacent to a free disk area, you need to merge the reclamation area into the free disk area. If the recycle zone is not adjacent to any free zone, the recycle zone is attached to the end of the chain as a separate free zone.

A chart method

Both continuous and discrete distribution are applicable.

Note: Each binary bit corresponds to a disk block. 0 indicates that a disk block is free, and 1 indicates that a disk block is allocated.

Allocation: If the file needs K blocks, scan the bitmap in sequence first to find K adjacent or non-adjacent “0”; Then calculate the corresponding disk block number according to the size and bit number, and allocate the corresponding disk block to the file; Finally, set the corresponding bit to “1”.

Recycling: calculate the corresponding size and bit number according to the number of the recovered disk block; Then set the corresponding binary bit to “0”.

Group chaining

The free list method or free linked list method is not suitable for large file systems because the free table or free linked list can be too large.

In UNIX system, the group link method is used to manage the free disk blocks.

In the directory area of the file volume, a disk block is used as the super block. The super block needs to be read into the memory when the system starts. And make sure the memory is consistent with the “superblock” data in external memory.

Allocation: 100 free blocks are required

  1. Check that the number of blocks in the first group is sufficient. 100 is 100, that’s enough.
  2. Allocate 100 free blocks in the first group. But because block 300 contains the next set of information, the data in block 300 needs to be copied to the superblock.

Reclaim: Assume that each group has a maximum of 100 free blocks. At this point, the first group already has 99 blocks and one more block needs to be reclaimed.

You need to copy the data from the super block to the newly reclaimed block and modify the contents of the super block so that the newly reclaimed block becomes the first grouping.

Document classification

According to the structure of the file classification, can be divided into unstructured files, structured files.

No structure file

The data inside a file is a series of binary streams or character streams. Also known as “streaming files”. TXT file in the Windows operating system.

Structured file

A set of similar records, also known as a “record file”. Each record consists of a number of data items. For example: database table file, mysql table inside the storage is structured data.

File directory

File control block

The operating system has a process control block PCB for process management, and the operating system also introduces a file control block FCB for file management, as shown in the figure above, which is a directory item.

Directory operations:

  1. Search: When a user wants to use a file, the system searches the directory based on the file name and finds the directory entry corresponding to the file
  2. Creating a file: When creating a new file, you need to add a directory entry to its owning directory
  3. Deleting a file: When deleting a file, you need to delete the corresponding directory entry from the directory
  4. Display directory: The user can request to display the contents of the directory, such as display all files and corresponding properties in the directory
  5. Modify directory: Some file attributes are stored in directories, so when these attributes change, you need to modify the corresponding directory entry (for example, file renaming).

When we double-click "e-books", the operating system will find the key word "e-books" in the table of contents corresponding to the directory entry (that is, records), and then read "e-books" directory information from external memory into memory, so that the contents of the "e-books" directory can be displayed.

Hierarchy of the file system

The process is as follows: Suppose a user requests to delete the last 100 records of the file “D:/ working directory/student information.xlsx”.

  1. The user needs to make this request through an interface provided by the operating system – the user interface

  2. The user provides the path for storing files. Therefore, the operating system needs to search for the corresponding directory item, file directory system, layer by layer

  3. Different users have different operation permissions on files, so in order to ensure security, it is necessary to check whether users have access permissions — Access control module (Access control verification layer)

  4. Once the user’s access rights are verified, the record number provided by the user needs to be converted into the corresponding logical address — the logical file system and file information buffer

  5. Once you know the logical address of the target record, you need to translate it to the actual physical address — the physical file system

  6. To delete this record, a request must be made to the disk device – device Manager module

  7. After deleting these records, there will be some free disk blocks, so these free disk blocks will be reclaimed – the auxiliary allocation module

The directory structure

Single-level directory structure

Early operating systems did not support multi-level directories, and only one directory table was established in the entire system, with each file holding one directory entry.

Single-level directories implement “access by name”, but do not allow file names. Before creating a file, check whether the file has the same name in the directory table and insert the directory entry corresponding to the new file into the directory table.

Obviously, the single-level directory structure is not suitable for multi-user operating systems.

A two-level directory structure

Early multiuser operating systems used a two-level directory structure. It is divided into Master File Directory (MFD) and User File Directory (UFD).

Multi-level directory structure

Also called tree structure.

When a user (or user process) wants to access a file, the file is identified by the file pathname, which is a string. Directories at different levels are separated by slash (/). The path from the root directory is called an absolute path.

For example, the absolute path of selfie.jpg is/photo /2015-08/ selfie.jpg. The system finds the next directory layer by layer based on the absolute path. The table of contents that has just been read into the root directory from external memory; After finding the location of the "photos" directory, read the corresponding directory table from external memory; Locate the directory 2015-08 and read the directory table from the external memory. Finally I found the location of the file "selfie.jpg". The whole process requires three read disk I/O operations.Copy the code

Many times, users will access multiple files in the same directory consecutively (for example, to view multiple photo files in the “2015-08” directory consecutively). Obviously, it is inefficient to start from the root directory each time. So you can set a “current directory”.

For example, if the photos directory file is already open, that is, the directory table has been called into memory, it can be set to the current directory. When a user wants to access a file, he can use a "relative path" from the current directory. In Linux, ". Indicates the current directory. Therefore, if Photo is the current directory, the relative path of selfie.jpg is./2015-08/ selfie.jpg. From the current path, you only need to query the photos directory table in the memory to know the location of the 2015-08 directory table. From the external memory, you can know the location of the self.jpg directory. After current Directory and Relative path are added, the number of disk I/ OS is reduced. This improves file access efficiency.Copy the code

Mysql index uses b+ tree structure instead of red black tree structure.

Advantages and disadvantages: The tree directory structure can easily classify files and has a clear hierarchical structure. In addition, it can effectively manage and protect files. However, the tree structure is not convenient for file sharing. Therefore, a “acyclic graph directory structure” is proposed.

Acyclic graph directory structure

You can point to the same file with different file names, or even point to the same directory (sharing everything in the same directory).

You need to set a share counter for each shared node to keep track of how many places are sharing the node at this point. When a user makes a request to delete a node, only the FCB of the user is deleted and the shared counter is reduced by 1, but the shared node is not directly deleted. The node is deleted only if the shared counter is reduced to 0.

Note: Sharing a file is different from copying a file. In a shared file, each user points to the same file, so if one user modifies the file data, all users can see the file data changes.

Improvement of index node FCB

All file descriptions except file names are placed in the index node.

Assume that an FCB is 64B and the disk block size is 1KB, each disk block can store only 16 FCBS. If a file directory contains 640 entries, a total of 40 disk blocks are required. Therefore, retrieving a directory based on a file name requires an average of 320 directory entries, and the magnetic disk needs to be started 20 times on average (one disk I/O is read each time).

If the index node mechanism is used, the file name occupies 14B, and the pointer station of index node is 2B, each disk block can store 64 directory items, so only 320/64 = 5 disk blocks need to be read on average to retrieve directories by file name. Obviously, this will greatly speed up file retrieval.

Virtual file system (Linux file system source code implementation)

The following code is an example of a Linux file system.

The disks and CD-RoMs we use are block devices, that is, they read and write in terms of data blocks. Think of disks and CD-roMs as a huge array of data blocks. However, this method of reading and writing is not very friendly to humans, so you usually need to mount a file system on a disk or CD to use it. So what is a file system? A file system is a way of storing and organizing data, making it easy to access and find it. After mounting the file system, we can access the data on the disk using a format such as /home/docs/test.txt instead of using a block number.

On Linux, you can use multiple file systems to mount different devices, such as ext2, ext3, NFS, and so on. However, the file processing interfaces provided to users are consistent, that is, whether the ext2 file system is used or the EXT3 file system is used, the file processing interfaces are the same. The advantage of this is that the user does not care which file system is used, and only needs to use a uniform way to process files. So how does Linux do it? This is thanks to the Virtual File System (VFS).

Virtual file system: defines a set of specifications for different file systems. Each file system can be connected to the virtual file system only after being written according to the specifications of the virtual file system. This is a bit like the “interface” in object-oriented languages, when a class implements all the methods of an interface, it can be considered as that interface. VFS is primarily a bridge between the user and the kernel, allowing users to access different file systems through an interface provided by VFS.

NFS file system: The NFS file system used by our company is a distributed network file system. In cluster mode, each machine can access data only by mounting NFS.

Here is a list of NFS disk usage in our company:

The data structure of a virtual file system

Because VFS defines a uniform interface layer for different types of file systems, VFS defines a set of specifications that real file systems must be programmed to follow. The VFS abstracts several data structures to organize and manage different file systems: super_blocks, inodes, dentries, and files. Understanding the DEFINITION and role of these data structures is essential to understanding VFS.

Super block

Because Linux support for multiple file system, so in the kernel must pass through a data structure to describe the specific file system information and related operations, such as the VFS defines a data structure called the super block (the super_block) to describe the specific file system, that is the kernel is through the superblock cognitive specific file system, A specific file system corresponds to a superblock structure, which is defined as follows (due to the large number of super_block members, only some of them are listed here) :

struct file_system_type { const char *name; int fs_flags; struct super_block *(*read_super) (struct super_block *, void *, int); // How to read the file system superblock in the device... }; struct super_operations { void (*read_inode) (struct inode *); Void (*write_inode) (struct inode *, int); Void (*put_inode) (struct inode *); Void (*delete_inode) (struct inode *); Void (*put_super) (struct super_block *); Void (*write_super) (struct super_block *); // Write the superblock to disk... }; struct super_block { struct list_head s_list; /* Keep this first */ kdev_t s_dev; // Device number unsigned long s_blocksize; Unsigned char s_blocksize_bits; unsigned char s_lock; unsigned char s_dirt; Struct file_system_type *s_type; Struct super_operations *s_op; Struct dquot_operations *dq_op; struct dquot_operations *dq_op; unsigned long s_flags; unsigned long s_magic; struct dentry *s_root; // Mount the root directory wait_queue_head_t s_wait; struct list_head s_dirty; /* dirty inodes */ struct list_head s_files; struct block_device *s_bdev; struct list_head s_mounts; struct quota_mount_options s_dquot; union { struct minix_sb_info minix_sb; struct ext2_sb_info ext2_sb; . } u; . };Copy the code

Some key members:

  • S_dev: saves the device ID
  • S_blocksize: Data blocksize used to save file systems (file systems are in blocks)
  • S_type: type of the file system (provides a way to read the file system superblock on the device)
  • S_op: indicates the list of operations related to the superblock
  • S_root: mount root directory

Index node (inode)

Struct super_operations (struct super_operations) (struct super_operations) (struct super_operations) (struct super_operations) (struct super_operations) (struct super_operations) (struct super_operations) It contains information about the file such as its size, owner, creation time, disk location, etc. All files have a corresponding inode structure.

The definition of an inode is as follows (since there are many inode members, only some of them are listed here, please refer to the Linux source code for details) :

struct inode_operations { int (*create) (struct inode *,struct dentry *,int); struct dentry * (*lookup) (struct inode *,struct dentry *); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); . }; struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char *, size_t, loff_t *); ssize_t (*write) (struct file *, const char *, size_t, loff_t *); . }; struct inode { ... unsigned long i_ino; atomic_t i_count; kdev_t i_dev; umode_t i_mode; nlink_t i_nlink; uid_t i_uid; gid_t i_gid; kdev_t i_rdev; loff_t i_size; time_t i_atime; time_t i_mtime; time_t i_ctime; . struct inode_operations *i_op; struct file_operations *i_fop; struct super_block *i_sb; . union { struct minix_inode_info minix_i; struct ext2_inode_info ext2_i; . } u; };Copy the code

Several important members of the inode:

  • I_uid: indicates the user to which the file belongs
  • I_gid: group to which the file belongs
  • I_rdev: indicates the id of the device where the file resides
  • I_size: indicates the file size
  • I_atime: last access time of a file
  • I_mtime: indicates the time when a file is last modified
  • I_ctime: time when a file is created
  • I_op: list of inode-related operations
  • I_fop: list of file-related operations
  • I_sb: super block of the file system where a file resides

We should focus on the i_op and i_FOP members. The i_op member defines a list of directory-specific operation methods, such as the inode->i_op->mkdir() method triggered by the mkdir() system call, and the inode->i_op->link() method triggered by the link() system call. The i_fop member defines the list of methods that operate on the file after it is opened, such as the inode->i_fop->read() method triggered by the read() system call and the inode->i_fop->write() method.

Directory entry (dentry)

A directory entry is a data structure maintained by the kernel that is not stored on disk but cached in memory.

The main purpose of a directory entry is to facilitate the search for files. Each component of a path, whether a directory or a regular file, is a directory entry object. Such as, the path/home/liexusong/example. C., the directory/home /, liexusong/and file example. C corresponds to a directory entry object. Unlike the previous two objects, the directory item objects have no corresponding disk data structure, and the VFS parses them into directory item objects on the spot as it traverses the path names. Its definition is as follows:

struct dentry_operations { int (*d_revalidate)(struct dentry *, int); int (*d_hash) (struct dentry *, struct qstr *); int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); }; struct dentry { ... struct inode * d_inode; Struct dentry * d_parent; // The parent directory of the current directory entry... struct qstr d_name; // Directory name unsigned long d_time; struct dentry_operations *d_op; Struct super_block * d_sb; // The superblock object of the file system... unsigned char d_iname[DNAME_INLINE_LEN]; } is used when the directory name contains no more than 16 characters.Copy the code

File structure (File)

The file structure is used to describe an open file, including the current read/write offset of the file, the file opening mode, and the list of file operation functions. The file structure is defined as follows:

struct file { struct list_head f_list; struct dentry *f_dentry; Struct file_operations *f_op; Atomic_t f_count; // counter (indicating how many users have opened the file) unsigned int f_flags; // Mode_t f_mode; // Open mode loff_t f_pos; Unsigned long f_readA, f_RAMax, f_RAend, f_ralen, f_RAwin; struct fown_struct f_owner; Unsigned int f_uid, f_gid; // Open user id and group ID int f_error; unsigned long f_version; /* needed for tty driver, and maybe others */ void *private_data; };Copy the code

A table of relationships between data structures

Virtual file system implementation

This section covers a number of concrete implementations, which you may or may not see.

  1. Registering a File system

Linux creates virtual file systems to support different file systems. Virtual file systems are more like specifications (or interfaces), and real file systems need specifications (interfaces) that implement virtual file systems to be plugged into the Linux kernel.

To enable the Linux kernel to discover the actual filesystem, the register_filesystem() function must first be used to register the filesystem.

Register_filesystem () is simply implemented to add fs of type struct file_system_type to the file_systems global list.

When installing the Linux operating system, you need to format the disk as a specified file system. In fact, formatting is to write the super block information of the file system to the disk. But when Linux starts up, it iterates through all registered file systems and calls its read_super() interface to try to read the super block information. Since each file system’s super block has a different magic number that identifies different file systems, when the read_super() interface returns success, Indicates that the superblock was read successfully and the file system used by the disk is identified.

After the super block information is read successfully, the dentry structure of the root directory is saved in the root and PWD fields of the current process. Root indicates the root directory, and PWD indicates the current working directory.

Open the file

To use a file, you must open the file, which is done using the open() system call, which ultimately calls the kernel’s sys_open() function.

The main flow of the sys_open() function is:

  • Get a free file descriptor by calling get_unused_fd().
  • Call filp_open() to open the file and return the file structure of the open file.
  • Call the fd_install() function to associate the file descriptor with the file structure.
  • Returns the file descriptor, which is the return value of the open() system call.

The file is eventually created by calling the create() method of the inode structure. This method is provided by the real file system, so the real file system only needs to mount the method of creating files to the inode structure, and the virtual file system does not need to know the implementation process of the real file system, which is the real reason why the virtual file system can support multiple file systems.

Read and write files

Reading the contents of the file is done by the read() system call, which ultimately calls the sys_read() kernel function.

The sys_read() function first calls the fget() function to convert the file descriptor into a file structure, and then reads the contents of the file by calling the read() method of the file structure, which is provided by the real file system. So the final process will vary depending on the file system, such as ext2 file systems that will eventually call generic_file_read() to read the contents of the file.

Writing content to a file is done by calling the write() system call, which ultimately calls the sys_write() kernel function.

The sys_write() function is similar to sys_read() in that it first calls fget() to convert the file descriptor to a file structure, and then writes the content to the file by calling the write() method of the file structure. For ext2 file systems, The write() method corresponds to the ext2_file_write() function.

Cache I/O && Direct I/O

Cache I/O was introduced to reduce I/O operations on block devices, but since read and write operations are cached first and then copied from cache to user space, there is an additional memory copy operation.

Caching I/O has the advantage of reducing I/O operations on block devices, but has the disadvantage of requiring one more memory copy. In addition, some applications that need to manage the I/O cache themselves (such as database systems) may need to use direct I/O.

The red box in the figure above shows the cache I/O location, located between the virtual file system and the real file system.

That is, when the virtual file system reads a file, it first checks the cache to see if the contents of the file to be read are in the cache, and if so, it reads directly from the cache. The same is true for writing files, which are first written to the cache and then synchronized by the operating system to a block device, such as a disk.

Actually, memory management, in order to solve the gap between disk IO and CPU speed mismatch problem, also introduces three-level cache devices, which in order to solve the multi-core CPU each CPU cache data inconsistency caused by concurrent problems, introduces cache coherence protocol msci, this agreement is the foundation of high concurrent access memory implementation.

Docker image implementation

Docker’s underlying troika consists of Namespace, CGroup, and UnionFS (federated file System), which is the basis of Docker’s image. (These three designs of Linux, which are also considered the most beautiful designs of Linux, are the basis for the implementation of Docker and later K8S).

UnionFS is a hierarchical, lightweight, and high-performance file system that allows changes to a file system to be layered on top of each other as a single commit while mounting different directories to the same virtual file system.

UnionFS is the basis of Docker images. The images can be inherited through layers. Based on the basic images (no parent images), various specific application images can be made. Since there are a variety of UnionFS under Linux (such as AUFS, OverlayFS and Btrfs), we use OverlayFS, which is relatively simple to implement, as the analysis object.

UnionFS-OverlayFS

Figure 1 Shows the OverlayFS file system has three roles, lowerdir, upperdir, and merged.

  • Lowerdir is a read-only layer and files in this layer cannot be modified.
  • Upperdir is the read-write layer where you can modify files.
  • Merged is a merged layer that displays merged files from the lowerdir and upperdir layers.

You need to mount the OverlayFS file system before using it. The basic commands for mounting the OverlayFS file system are as follows:

$ mount -t overlay overlay -o lowerdir=lower1:lower2,upperdir=upper,workdir=work merged
Copy the code

In this case, overlay indicates OverlayFS. -o specifies lowerdir, upperdir, and workdir. Merged directory is the final mount point directory.

Implementation principle of OverlayFS

The function of the OverlayFS file system is to merge the contents of the upper directory and the lower directory. If the upper directory and the lower directory exist in the same file or directory, what does the OverlayFS file system do?

  • If the same file exists in the upper and lower directories, the file in the upper directory prevails. For example, if the a.txt file exists in both the upper and lower directories, the a.txt file in the upper directory prevails.

  • If both upper and lower directories exist in the same directory, merge the contents of the upper and lower directories. For example, if the test directory exists in both the upper and lower directories, merge the contents in the test directory in the upper directory and the test directory in the lower directory.

The concrete implementation is as follows:

When the ovl_dir_read() function is called to read the list of files in the lower and upper directories, the ovl_fill_merge() function is called to filter the same files. The filtering operation is realized by red-black tree, and the filtering process is as follows:

Read the list of files in the upper directory, save to the list list, and save to the red-black tree.

Read the list of files in the lower directory and check whether the file already exists in the red-black tree. If so, skip the file, otherwise add it to the list.

conclusion

Linux follows the Unix philosophy that everything is a file, such as a directory is a file, a device is a file, and so file systems play a very important role in Linux.

This article refers to wang Dao teacher operating system lectures, as well as: github.com/ZhiQinZeng/… Linux source code analysis is also my internal sharing preparation in the company.

I am very happy to meet you in this way of learning. I am forrest Gump, the author of “Forrest Gump’s Code Road”. I am a program girl who is committed to sharing back-end basic knowledge