1. Introduction

In this article, we discuss how PostgreSQL can estimate the maximum number of files that a process can open at the same time. In this article, we analyze how PostgreSQL can determine the maximum number of files that a process can open at the same time. The PostgreSQL database manages and uses File Descriptor handles.

2. Start with VFD

PostgreSQL servers open many copies of files for a variety of reasons. This includes base tables, temporary files (such as sorting and hashing spool files), and random calls to libpq C regular libraries (such as System (3)). The number of open files that a process can have can easily exceed the system limit (on many modern operating systems, the value is around 1024, but can cause it on others).

For unified management, file handles are used, which is why PostgreSQL introduces VFD, or virtual file redirection (virtual file descriptor (VFD)). The size of the file allocated by the kernel. When a process needs to open a file, VFD always returns a valid and available file location. Details, and the corresponding logical judgment processing.

Essentially, the number of files PostgreSQL can use is still dictated by the operating system, but only because of special internal VFD implementations that give processes the impression that the small number of files is infinite. Processes operate on a series of files (UNIX, everything is a file, so this includes files, directories, etc.), not directly by calling system functions (such as open, read, write, find, sync, etc.), or even via VFD. A series of logic judgments are made inside the VFD, and finally the file handle fd in the kernel of one file to be processed. However, the fd that is fed back to the process is not the real FD allocated by the kernel, but a virtual file handle VFD that has been mapped through an internal layer of the VFD. In fact, this VFD is the VFD index, or array index, of the real file handle FD in the VfdCache VFD pool.

From the Linux kernel architecture diagram, the VFD sits above the system call (i.e., open, read, etc.) functions at the application layer. As shown in the following figure:

The order is that PostgreSQL procedures call VFD directly, which encapsulates system functions internally. When the process wants to obtain data related to the file, it can call the series of VFD encapsulation functions directly.

2.1 VFD data structure

PostgreSQL manages the LRU (Last Recently Used) cache pool for Vfd by declaring a struct data type called Vfd. So, before we get into LRU pool management, let’s have a familiar concept of the structure-type declaration of a Vfd.

PostgreSQL processes return a Vfd structure every time a file is opened. For type declaration VFDS, located in the SRC/backend/storage/file/fd. The c file, its format is as follows:

typedef struct vfd
{
  int        fd;          /* current FD, or VFD_CLOSED if none */
  unsigned short   fdstate;      /* bitflags for VFD's state */
  ResourceOwner   resowner;      /* owner, for automatic cleanup */
  File      nextFree;      /* link to next free VFD, if in freelist */
  File      lruMoreRecently;  /* doubly linked recency-of-use list */
  File      lruLessRecently;
  off_t      fileSize;      /* current size of file (0 if not temporary) */
  char         *fileName;      /* name of file, or NULL for unused VFD */
  /* NB: fileName is malloc'd, and must be free'd when closing the VFD */
  int        fileFlags;      /* open(2) flags for (re)opening the file */
  mode_t      fileMode;      /* mode to pass to open(2) */
} Vfd;
Copy the code

In PostgreSQL 13.2, the structure data type has a total of 10 data members. The functions that each member serves are described below, which will help you understand the LRU pool logic in the following steps.

  • fd

Fd is the actual file descriptor fd allocated by the kernel to which the current VFD corresponds. If VFD has no file open (that is, no file descriptor), its initial value is VFD_CLOSED, which is -1. The macro name is declared as follows:

#define VFD_CLOSED (-1)
Copy the code
  • fdstate

Record the status flag bits of the VFD. In version 13.2, there are three kinds of status markers: FD_DELETE_AT_CLOSE, FD_CLOSE_AT_EOXACT and

FD_TEMP_FILE_LIMIT.

Its statement reads as follows:

#define FD_DELETE_AT_CLOSE  (1 << 0)  /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT  (1 << 1)  /* T = close at eoXact */
#define FD_TEMP_FILE_LIMIT  (1 << 2)  /* T = respect temp_file_limit */
Copy the code

The PostgreSQL version is highlighted here because the flag bit value of this member varies widely between versions. For example, in V9.6.7, the value of the flag bit is declared as follows:

/* these are the assigned bits in fdstate below: */ #define FD_TEMPORARY (1 << 0) /* T = delete when closed */ #define FD_XACT_TEMPORARY (1 << 1) /* T = delete at eoXact * /Copy the code

Not only does the macro name change, but its value also varies:

  • FD_DELETE_AT_CLOSE

If the first position of fdState is 1, then the file should be deleted when it is closed.

  • FD_TEMP_FILE_LIMIT

If fdState is in position 2, 1, then the temporary file restriction is observed.

  • FD_CLOSE_AT_EOXACT

If fdState position 3 is 1, it is closed at eoXact.

  • resowner

Record the resource owner for automatic cleanup. This member belongs to the following structure type:

typedef struct ResourceOwnerData
{
  ResourceOwner parent;    /* NULL if no parent (toplevel owner) */
  ResourceOwner firstchild;  /* head of linked list of children */
  ResourceOwner nextchild;  /* next child of same parent */
  const char *name;      /* name (just for debugging) */

  /* We have built-in support for remembering: */
  ResourceArray bufferarr;  /* owned buffers */
  ResourceArray catrefarr;  /* catcache references */
  ResourceArray catlistrefarr;  /* catcache-list pins */
  ResourceArray relrefarr;  /* relcache references */
  ResourceArray planrefarr;  /* plancache references */
  ResourceArray tupdescarr;  /* tupdesc references */
  ResourceArray snapshotarr;  /* snapshot references */
  ResourceArray filearr;    /* open temporary files */
  ResourceArray dsmarr;    /* dynamic shmem segments */
  ResourceArray jitarr;    /* JIT contexts */

  /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
  int      nlocks;      /* number of owned locks */
  LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];  /* list of owned locks */
}      ResourceOwnerData;

Copy the code

The internal member list of the structure records snapshots, dynamic SHMEM segments, allocated cache resources, and so on. A section will be devoted to the Resowner members of the Vfd.

  • nextFree

Point to the next free VFD. The nextFree member has a data type of FILE. Note that FILE in Vfd is not the FILE stream FILE data type in the C library. In Vfd, the FILE type is an alias of the int integer, which represents the subscript of the Vfd in the VfdCache array. As follows:

typedef int File;
Copy the code

For the FILE type in the C library, the structure type is declared as follows (the structure type declaration comes from Glibc V2.31).

typedef struct _IO_FILE FILE;

struct _IO_FILE
{
  int _flags;    /* High-order word is _IO_MAGIC; rest is flags. */

  /* The following pointers correspond to the C++ streambuf protocol. */
  char *_IO_read_ptr;  /* Current read pointer */
  char *_IO_read_end;  /* End of get area. */
  char *_IO_read_base;  /* Start of putback+get area. */
  char *_IO_write_base;  /* Start of put area. */
  char *_IO_write_ptr;  /* Current put pointer. */
  char *_IO_write_end;  /* End of put area. */
  char *_IO_buf_base;  /* Start of reserve area. */
  char *_IO_buf_end;  /* End of reserve area. */

  /* The following fields are used to support backing up and undo. */
  char *_IO_save_base; /* Pointer to start of non-current get area. */
  char *_IO_backup_base;  /* Pointer to first valid character of backup area */
  char *_IO_save_end; /* Pointer to end of non-current get area. */

  struct _IO_marker *_markers;

  struct _IO_FILE *_chain;

  int _fileno;
  int _flags2;
  __off_t _old_offset; /* This used to be _offset but it's too small.  */

  /* 1+column number of pbase(); 0 is unknown. */
  unsigned short _cur_column;
  signed char _vtable_offset;
  char _shortbuf[1];

  _IO_lock_t *_lock;
#ifdef _IO_USE_OLD_IO_FILE
};
Copy the code

How to understand the above mentioned “pointing to the next idle VFD”? Go to section 3.3.

  • lruMoreRecently

This member points to a virtual file descriptor that has been used more recently than the VFD.

  • lruLessRecently

Points to virtual file descriptors in this LRU virtual handle pool that are less recently used than the VFD.

  • fileSize

If the current VFD refers to a file rather than a temporary file, it indicates the size of the current file.

  • fileName

File name, NULL for unused VFD. Note that the fileName here is the memory space for dynamic malloc, and when closing the VFD virtual file descriptor, you need to free the memory space for the pointer.

  • fileFlags

File permission tags, such as when the file does not exist and the second flag parameter of open() or on O_CREATE, which sets the read, write, and execute permissions of the file’s owner, owner group, and other users.

  • fileMode

Flag used to open/reopen (open()) files. For example, O_RDONLY (read only), O_WRONLY (write only), or O_RDWR (read and write). As follows:

#define PG_MODE_MASK_OWNER        (S_IRWXG | S_IRWXO)
/*
 * Mode mask for data directory permissions that also allows group read/execute.
 */
#define PG_MODE_MASK_GROUP      (S_IWGRP | S_IRWXO)

/* Default mode for creating directories */
#define PG_DIR_MODE_OWNER      S_IRWXU

/* Mode for creating directories that allows group read/execute */
#define PG_DIR_MODE_GROUP      (S_IRWXU | S_IRGRP | S_IXGRP)

/* Default mode for creating files */
#define PG_FILE_MODE_OWNER        (S_IRUSR | S_IWUSR)

/* Mode for creating files that allows group read */
#define PG_FILE_MODE_GROUP      (S_IRUSR | S_IWUSR | S_IRGRP)
Copy the code

After introducing the member list of Vfd structure types, let’s focus on how PostgreSQL uses Vfd data structures to implement LRU handle resource pools.

3. LRU virtual file descriptor pool

With PostgreSQL, each background process (for more on the concept of background processes, read the PostgreSQL Database Architecture) uses a pool called LRU (Last Recently Used) to manage all opened virtual file descripfiers, VFD. For each VFD in the LRU pool, there is a one-to-one mapping to an open file on the disk. Each process has its own private LRU pool and file descriptor VFD. When a process needs to open a file, it directly requests a VFD from its own LRU pool, and releases the VFD (including the corresponding memory segment, cache resources, snapshots, etc.) when it is not needed.

3.1 VfdCache Global array

PostgreSQL starts managing the LRU pool by defining a global pointer variable VfdCache in the fd.c file that points to the Vfd data type. It is a pointer to a virtual file descriptor array that grows dynamically as needed. VfdCache serves as the head of the LRU pool (similar to the head pointer in a linked list, read Linked Lists of Data Structures for more information).

The VfdCache pointer variable is defined as follows:

static Vfd *VfdCache;
static Size SizeVfdCache = 0;

/*
 * Number of file descriptors known to be in use by VFD entries.
 */
static int  nfile = 0;
Copy the code

There are three important variables: VfdCache, SizeVfdCache, and nfile. VfdCache points to the LRU pool header, and SizeVfdCache indicates the current SIZE of the LRU pool. Nfile indicates the number of VFD virtual file descriptor handles that have been used in the current LRU pool.

3.2 VfdCache array initialization

The VfdCache array pointer variable is initialized before the Postmanster process gets up. And set the value of the fd member to VFD_CLOSED, indicating that the file descriptor fd is not available. This initialization is done by the function InitFileAccess().

Assert(SizeVfdCache == 0);  /* call me only once */

/* initialize cache header entry */
VfdCache = (Vfd *) malloc(sizeof(Vfd));
if (VfdCache == NULL)
  ereport(FATAL,
      (errcode(ERRCODE_OUT_OF_MEMORY),
       errmsg("out of memory")));

MemSet((char *) &(VfdCache[0]), 0, sizeof(Vfd));
VfdCache->fd = VFD_CLOSED;

Copy the code

Note that VfdCache[0] is not an available VFD, it is the head node (that is, the head pointer) of the entire LRU pool. When initialization is complete, VfdCache points to an address in the heap space, as shown below:

Because the global variable SizeVfdCache dynamically records the size of the VfdCache pool, SizeVfdCache will be set to 1. This is because VfdCache points to the head of the LRU pool. Although VFD[0] is not an available VFD, it is the only VFD that points to the LRU pool and acts as a head node.

{.... // Omit several SizeVfdCache = 1; /* register proc-exit hook to ensure temp files are dropped at exit */ on_proc_exit(AtProcExit_Files, 0); }Copy the code

On_proc_exit () also registers a callback function to ensure that temporary files are deleted when the process exits. Each time a file is opened, the data member FDState in the Vfd structure data type is initialized internally based on the type of the opened file. When the process exits, it calls the corresponding functions according to the different values of the FDState member. The following code looks like this:

switch (desc->kind)
{
  case AllocateDescFile:
    result = fclose(desc->desc.file);
    break;
  case AllocateDescPipe:
    result = pclose(desc->desc.file);
    break;
  case AllocateDescDir:
    result = closedir(desc->desc.dir);
    break;
  case AllocateDescRawFD:
    result = close(desc->desc.fd);
    break;
  default:
    elog(ERROR, "AllocateDesc kind not recognized");
    result = 0;      /* keep compiler quiet */
    break;
}
Copy the code

3.3 LRU pool structure diagram

The LRU pool is a two-way linked list that starts and ends at the element VfdCache[0]. Element 0 is a special node that does not represent a file, where the fd field is always equal to VFD_CLOSED. Element 0 is a header node that indicates the start/end of the LRU pool. Only currently open (FD allocated) VFD elements are in the LRU pool.

Although the LRU pool is a bidirectional linked list, there is no pointer in the Vfd structure. Instead, lruMoreRecently and lruLessRecently int member variables are used to implement the function of next and prev Pointers in the bidirectional linked list.

For each VFD in the LRU pool, link the two VFD variables using the members lruMoreRecently and lruLessRecently, and link the more recently used VFD via the lruMoreRecently member array subscript; The lruLessRecently member array subscript links the recently unused VFD. As shown in the figure below:

Where VfdCache[0] acts as the head node of the link pool (special VFD); In addition, the tail element VfdCache[0] of the LRU pool is linked to the VfdCache[0] header via the lruLessRecently member, while the VfdCache[0] header is linked to VfdCache[n] via the lruMoreRecently member. This makes it easy to find the least-recently used VFD in the pool through the VfdCache[0] header node.

Of course, the size of this LRU pool is equally limited by the operating system’s ability to open file descriptor data for the process. In PostgreSQL, this is closely related to the value of the max_safe_fds variable.

3.3.1 Obtaining the VFD from the LRU pool

As mentioned in Section 3.1, when postMaster gets up, it allocates a Vfd size to the VfdCache pointer variable. At this point, however, there is no VfdCache virtual file handle available. As mentioned earlier, VfdCache[0] functions as a bidirectional list head node, so it does not store valid VFDS. Therefore, on the first attempt to obtain a VFD, the process will first go through the AllocateVfd() function to allocate a valid available VFD variable.

When VfdCache allocates VFDS, it takes the scheme of multiple requests (the minimum number of VFD requests is 32). For example, when the VfdCache is initialized for the first time, if the memory space is successfully allocated, the SizeVfdCache variable is set to 1. This variable records the number of VFDS applied by VfdCache. When AllocateVfd() is called for the first time, if the size of AllocateVfd() is smaller than 32 because of the size of evFDCache = 1, then 32 VFD variables are required. The diagram of VFD application is shown as follows:

The corresponding code reference is as follows:

Size newCacheSize = SizeVfdCache * 2; // after InitFileAccess, set it to 1 Vfd *newVfdCache; if (newCacheSize < 32) newCacheSize = 32; /* * Be careful not to clobber VfdCache ptr if realloc fails. */ newVfdCache = (Vfd *) realloc(VfdCache, sizeof(Vfd) * newCacheSize); if (newVfdCache == NULL) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of memory"))); VfdCache = newVfdCache;Copy the code

When the VFD memory space is successfully applied, nextFree in the VFD variable is initialized, so that it points to the next VFD in turn. Since SizeVfdCache is equal to 1, initialization starts with VfdCache[1].

for (i = SizeVfdCache; i < newCacheSize; i++) { MemSet((char *) &(VfdCache[i]), 0, sizeof(Vfd)); VfdCache[i].nextFree = i + 1; //31 next--> 32 VfdCache[i].fd = VFD_CLOSED; } VfdCache[newCacheSize - 1].nextFree = 0; VfdCache[0].nextFree = SizeVfdCache; /* * Record the new size */ SizeVfdCache = newCacheSize; //1, 32, 64, 128, 256..Copy the code

After the nextFree member is initialized, resets the value of the global variable SizeVfdCache to the number of VFDS currently requested (32, 64, 128, 256, 512 until the operating system limits the number of file descriptors that processes can open).

file = VfdCache[0].nextFree;
VfdCache[0].nextFree = VfdCache[file].nextFree;

return file;
Copy the code

It then returns the available VfdCache, or VFD. Where file is the subscript of VFD in VfdCache array. Since it is the first time to apply for a VFD, VFD is applied successively from VfdCache[0], where the file is 1, 2, 3, 4, 5, 6, etc. When the available VFD is less than the 32 applied for this time, the new VFD is applied for 64. Get the VfdCache[1] variable in the LRU pool, as shown in the figure below.

Once the VFD array elements are available, the system function is called to open the specified file, and then the file descriptor fd returned by the open() system function is initialized to the FD member variable in VFD. Initialize the mode of opening the file and the file permission (if the file is created) to the VFD members fileFlags and fileMode respectively. And other members will be initialized according to the actual situation.

file = AllocateVfd(); vfdP = &VfdCache[file]; */ ReleaseLruFiles(); /* Close excess kernel FDs. * vfdP->fd = BasicOpenFilePerm(fileName, fileFlags, fileMode); if (vfdP->fd < 0) { int save_errno = errno; FreeVfd(file); free(fnamecopy); errno = save_errno; return -1; } ++nfile; DO_DB(elog(LOG, "PathNameOpenFile: success %d", vfdP->fd)); vfdP->fileName = fnamecopy; /* Saved flags are adjusted to be OK for re-opening file */ vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);  vfdP->fileMode = fileMode; vfdP->fileSize = 0; vfdP->fdstate = 0x0; vfdP->resowner = NULL;Copy the code

At this point, VFD is already a virtual file descriptor available to processes. The upper layer is given the array subscript nextFree where the VFD is located in the LRU pool, without supplying the VFD member value. The next final task is to initialize the two array index members lruMoreRecently and lruMoreRecently in VFD. Make them point to the VfdCache header node respectively. To quickly find the most recently used and infrequently used VFD in the LRU pool from VfdCache[0]. To facilitate the removal of less commonly used VFD according to the LRU policy if the LRU pool exceeds the operating system file descriptor limit. The corresponding code is as follows:

vfdP = &VfdCache[file];
vfdP->lruMoreRecently = 0;
vfdP->lruLessRecently = VfdCache[0].lruLessRecently;
VfdCache[0].lruLessRecently = file;
VfdCache[vfdP->lruLessRecently].lruMoreRecently = file;
Copy the code

4. To summarize

In this paper, by combining binary code, detailed analysis of PostgreSQL database allocation, management file internal principle. Because operating systems have strong limits on the number of FDS a process can open at the same time, and exceeding these limits can trigger kern-level problems, managing file compression becomes an urgent feature point in a PostgreSQL database. By using the VFD virtual file transformation map, you can gradually give the illusion that there is an infinite number of FDS to shrink files. The advantage of this is that processes don’t have to worry too much about the details of how to handle FDS. Call the corresponding series of function interfaces encapsulated in the fd.c file, shielding the system function call error code logic judgment processing.