Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

An overview,

IO performance analysis has always been one of the focuses of performance analysis. The analysis approaches are as follows:

When the logic of the code is clear, it is perfectly possible to know which files are frequently read and written. For performance analysts, however, it is often the case that they are faced with a system that they did not write themselves, and sometimes produced by multiple teams. That’s when there’s a lot of prevarication and argument. If the problem can be quickly addressed to a specific piece of code, to a specific document, it can improve the efficiency of communication.

In Linux, you can run the vmstat or iostat command to find disk I/O exceptions. You can view the system-level disk read/write volume and CPU usage, but cannot locate the process. After ioTOP is installed, you can locate the process. But you don’t know what file the change process is working on.

Ii. Core principles

This article is to consider from the system – level tools to complete the operation, more universal. Before we do that, we need to understand an important property of files: inode.

What is an inode? Let’s start with a schematic:

The smallest storage unit on a disk is sector. Each sector consists of eight sectors and a block (4096 bytes). As follows:

[root@7DGroup2 ~]# tune2fs -l /dev/vda1|grep Block
Block count:              10485504
Block size:               4096
Blocks per group:         32768
[root@7DGroup2 ~]#
Copy the code

File storage is made up of these blocks, and when there are more blocks, it looks like this:

The red part is the stored file, and we usually operate on the path when we operate on the file system directly ls or with other commands, those are upper level commands. When we execute a command, the operating system will find these files to do the corresponding operation, how to find these files, that is the inode. Inodes are used to store the meta information of these files, known as index nodes, which contain information such as:

  • The number of bytes
  • User ID
  • Group ID
  • Read, write, and execute permissions
  • Ctime indicates the time when the inode last changed, mtime indicates the time when the file content last changed, and atime indicates the time when the file was last opened
  • Number of links, how many file names point to the inode
  • Location of the file data block

With this information, we can perform operations on the file. This inode is actually stored on disk, which also takes up some space, as shown in green in the figure above.

When we see high IO at the system level, as shown below:

As you can see from the figure above, almost all the cpus in the system are waiting for IO. What to do? Using the same analysis we mentioned earlier, look at process-level and thread-level IO to find specific files. Let’s implement it.

Here we are using SystemTap, a tool mentioned in the previous 7Dgroup article but not expanded on. We’ll write more about similar tool principles and usage if possible later.

The Systemtap logic diagram is as follows:

From a logical diagram, it works at the kernel level, not the shell level. SystemTap provides an open door into the system kernel. SystemTap provides examples of disk I/O monitoring.

Take iotop. STP as an example, the source code is as follows:

#! /usr/bin/stap
global reads, writes, total_io

probe vfs.read.return {
    reads[execname()] +=  bytes_read
}

probe vfs.write.return {
    writes[execname()] +=  bytes_written
}


# print top 10 IO processes every 5 seconds
probe timer.s(5) {
    foreach (name in writes)
        total_io[name] +=  writes[name]
    foreach (name in reads)
        total_io[name] +=  reads[name]

    printf ("%16s\t%10s\t%10s\n"."Process"."KB Read"."KB Written")
    
    foreach (name in total_io-  limit 10)
         printf("%16s\t%10d\t%10d\n", name,
                reads[name]/1024, writes[name]/1024)

    delete reads
    delete writes
    delete total_io
    print("\n")}Copy the code

The result is that the top 10 processes are printed every 5 seconds.

There are two problems with this script:

  1. If the process name is the same but the PID is different, the process is counted together.

  2. We still don’t know what files the process manipulates.

VVFS. Read, VVFS. Write has a local variable ino, ino is the file’s inode, This allows us to explicitly detect the processes and files that receive the most reads and writes.

$ sudo stap -L 'vfs.{write,read}'
vfs.read  file:long pos:long buf:long bytes_to_read:long dev:long devname:string  ino:long name:string argstr:string $file:struct file* $buf:char*  $count:size_t $pos:loff_t*
vfs.write  file:long pos:long buf:long bytes_to_write:long dev:long devname:string  ino:long name:string argstr:string $file:struct file* $buf:char const*  $count:size_t $pos:loff_t*
Copy the code

The extended script is as follows:

#! /usr/bin/stap
global reads, writes, total_io

probe vfs.read.return {
     reads[execname(),pid(),ino] += bytes_read
}


probe vfs.write.return {
     writes[execname(),pid(),ino] += bytes_written
}


# print top 10 IO processes every 5 seconds
probe timer.s(5) {
    foreach  ([name,process,inode] in writes)
         total_io[name,process,inode] += writes[name,process,inode]
    foreach ([name,process,inode] in reads)
         total_io[name,process,inode] += reads[name,process,inode]
    printf  ("%16s\t%8s\t%8s\t%10s\t%10s\n"."Process"."PID"."inode"."KB Read"."KB Written")
    foreach  ([name,process,inode] in total_io- limit 10)
         printf("%16s\t%8d\t%8d\t%10d\t%10d\n", name,process,inode,
                reads[name,process,inode]/1024, writes[name,process,inode]/1024)

    delete reads
    delete writes
    delete total_io
    print("\n")}Copy the code

Three, an experiment

As an experiment, execute the DD command to do a high disk read/write operation. Run the following command:

dd bs=64k count=4k if=/dev/zero of=test oflag=dsync
Copy the code

Dd reads 64 K data from /dev/zero and writes it to the test file in the current directory for a total of 4 K times. On Linux, /dev/zero is a special file that provides unlimited NULL characters (NULL, ASCII NUL, 0x00) when you read it.

Iotop. STP monitoring results are as follows:

From the monitoring, we know that the DD process with PID 2978 reads the file with inode 1047 and writes the file with inode 663624, which are the two most read and write operations. Usually, the inode file is not known. Find / -inum 1047 is used to find the inum file. Using the stat command, we can see a detailed description of the file inode.

$ stat/ dev/zero file:"/dev/zero"Size: 0 Block: 0 IO block: 4096 characters Special file Device: 5H / 5D Inode: 1047 Hard Link: 1 Device type: 1,5 Permission: (0666/ crw-RW-Rw -) Uid: (0 / root) Gid: (0 / root) Environment: System_u :object_r:zero_device_t:s0 Recently accessed: 2017-05-02 10:50:03.242425632 +0800 Last modified: 2017-05-02 10:50:03.242425632 +0800 Last modified: 2017-05-02 10:50:03.242425632 +0800 Created at: -Copy the code

This analytical idea can be used in any system, but different systems use different tools. The environment used this time is CentOS, so in other systems, you can only find the corresponding other tools.

Four,

Again, understanding principles and thinking clearly is the focus of performance analysis. Tools are used to validate ideas. Don’t neglect the essentials.