System performance is always a hot topic. I have been doing performance tuning for several years, and this article is a summary of my work.

The first step in tuning is, why do you tune? That is, system analysis, analysis also needs to have indicators, do a good job of performance monitoring, see that it really needs to be tuned. You can’t “tune” for tuning’s sake. That’s not tuning. That’s breaking.

Purpose of performance analysis

  1. Identify system performance bottlenecks
  2. Provide a plan or reference for future optimization
  3. To achieve good use of resources. Hardware resources and software configuration.

Factors affecting performance

To determine what those factors are, first determine what type of application you have? Such as:

  1. cpu-intensive

For example, web servers like Nginx Node. js that require CPU for batch processing and math are of this type

  1. IO intensive

For example, mysql, a common database, consumes a large amount of memory and storage system and has low REQUIREMENTS on CPU and network. This application uses CPU to initiate I/O requests and then enters the sleep state.

Once the application type is identified, start analyzing what conditions can affect performance:

  1. A large number of web requests can fill up runqueues, do a lot of context switching, and break
  2. Large number of disk requests
  3. A large number of network cards are being processed
  4. And running out of memory.

It all boils down to four things

  1. cpu
  2. memory
  3. i/o
  4. network

Tools for system inspection

We know that these four chunks affect our performance, so what tools do we have to detect them?

The picture above is summed up by some foreign god.

Personally, I often use hTOP VMstat IOTOP SAR Strace IFTOP SS LSof ethTool MTR and so on

Ali’s TSAR and Glances are also recommended for system performance monitoring.

CPU performance monitoring and tuning

We can optimize this by checking CPU usage and by using tools to watch for context switches, interrupts, and code calls.

To clarify a few terms: cache: The CPU provides hardware-level caching in order to provide memory I/O performance. To view the cache, run the lscpu -p command

# lscpu 
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
Copy the code

Level 1 cache is static cache, which is divided into data cache and instruction cache. Level 2 and level 3 caches are dynamic caches, of which level 2 is shared cache.

In order to improve the CPU cache hit ratio, we usually bind the CPU to a certain core, which is called “CPU affinity”. On Linux, we can use the “taskset” command to do this

# taskset -pc 0 73890
pid 73890's current affinity list: 0
pid 73890's new affinity list: 0
Copy the code

But there are problems with that. For example, local memory allocation is not guaranteed, so we need to use NUMA to solve the problem

NUMA: non-consistent memory access mechanism. Each physical core has its own segment of memory used as the local node, each has its own memory controller, the nearest memory node is called the critical node.



The diagram above shows the simple topology of NUMA, derived from the Internet.

Numactl binds programs to specific NUMA nodes

Policy: default preferred node: current physcpubind: 0 nodebind: 0 membind: 0Copy the code

Note: Do not use NUMA on the database server. If you want to use numactl — interleave=all on the database startup. As operations and maintenance can be screwed.

CPU Scheduling Policy

  1. Real-time scheduling strategy
  • SCHED_FIFO static scheduling policy that runs once the CPU is occupied and runs until a higher-priority task arrives or abandons itself.
  • SCHED_RR time polling policy that reallocations time slices and places them at the end of the ready queue when a process runs out of time slices. Placing RR tasks at the end of the queue ensures fair scheduling of all RR tasks with the same priority. The value of a real-time scheduling policy ranges from 1 to 99. A larger number indicates a higher priority.
  1. The general strategy
  • The SCHED_OTHER default scheduling policy determines values by nice and counter values. The smaller nice is, the larger counter is, and the more likely it is that the process that has used the least CPU will be scheduled first. The value ranges from 100 to 139. A smaller number indicates a higher priority.
  • SCHED_BATCH
  • SCHED_IDLE

CHRT Change the real-time priority. Do not change it in production. The default value is RR scheduling

SCHED_OTHER was modified with nice and renice. In addition, other supports dynamic adjustment. You can manually modify NICE.

Context switches: Context switches

The Linux kernel treats each core as a separate processor. A kernel can run 50-50,000 processes simultaneously. Each thread is allocated a time slice and is not put back into the CPU queue until it runs out of time or is preempted by a thread of higher priority. The process of switching threads is context switch. The higher the context switch, the greater the workload of kernel scheduling.

Vmstat can see the cs level

Run Queue Run queue

Each CPU has a run queue. The thread is either in the sleep state (blocking and waiting for IO) or in the running state. The longer the run queue, the longer it will take for the CPU to process the thread. The run queue is global and is shared by all cpus

Load is used to describe a run queue. Its value is equal to the current thread being processed + the thread in the runqueue.

For example, if the current system has 2 cores, 2 threads executing rows, and 4 threads in the run queue, its load=2+4.

Vmstat w uptime indicates the load of the running queue.

CPU Performance Monitoring

With all that said, what are the values that you would normally look at? First of all, NUMA and algorithms are optimized for special cases. In general, numA and algorithms will not be changed. They need to be adjusted according to your business scenario, such as virtualization, cloud computing, etc.

Then the performance points we need to observe daily are:

  1. CPU utilization
  • us 60%-70%
  • sy 30%-35%
  • id 0%-5%
  1. Cs Context switch
  • Cs is related to CPU utilization, and a lot of switching is acceptable if the utilization described above is maintained
  1. Run queue
  • Less than or equal to 4 is best

Example:

# vmstat 1 5 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 3 0 1150840 271628 260684 5530984 0 0 2 1 0 0 22 4 73 0 0 5 0 1150840 270264 260684 5531032 0 0 0 5873 6085 13 13 73 0 0 50 1150840 263940 260684 5531040 0 0 04 6721 7507 15 13 72 0 040 1150840 263320 260684 5531068 0 0 0 6111 7117 10 13 76 0 0 40 1150840 262328 260684 5531072 0 0 0 6854 7673 18 13 68 0 0Copy the code

In the example, both CPU interrupts (in) and context switches (CS) are high, indicating that the kernel has to switch processes back and forth, and in is high indicating that the CPU is constantly requesting resources.

Memory is the memory

The term MMU: THE CPU does not interact with the hard disk. Data can only be called by the CPU if it has been loaded into memory. When the CPU accesses the memory, it first requests the memory monitor, which controls and allocates the read and write requests of the memory. This monitor is called the MMU(Memory management Unit).

The mapping of linear addresses to physical addresses requires a very large table if mapped per byte, which can be very complex. So the memory space is divided into another storage unit format, usually 4K.

If each process needs to search for the page table when accessing the memory, it needs to use buffer TLB. However, each search for THE TLB does not have or a large number of searches will still cause slow, so there is a hierarchical directory of the Page table. Page tables can be divided into level 1 directories, level 2 directories, and offsets.

In addition, there are two ways for the system to manage large amounts of memory:

  1. Increases the number of page tables in the hardware memory management unit
  2. The first method of increasing the page size is not practical, so we will consider the second method. Namely: large pages. A 4m page frame in a 32-bit system is a 2m page frame in a 64-bit system. The thicker the page frame, the more wasteful it is. View the large page of the system:

cat /proc/meminfo

AnonHugePages: 309248 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 6144 kB DirectMap2M: 1042432 kB DirectMap1G: 0 kB AnonHugePages: Transparent large pages, THP is an extraction layer that automatically creates, manages, and uses most aspects of large pages. Also HP must be set at boot time. Manually set the number of pages for large pages: sysctl vm. Nr_hugepages = 20Copy the code

When implementing DMA transfers, the DMA controller is in charge of the bus directly, so there is a bus control transfer problem. That is, before a DMA transfer, the CPU should hand over control of the bus to the DMA controller, and immediately after the DMA transfer, the DMA controller should hand back control of the bus to the CPU. A complete DMA transfer process must go through four steps: DMA request, DMA response, DMA transfer, and DMA end.

Virtual memory: on a 32-bit system, each process accesses memory as if it had four gigabytes of memory available. This is called virtual memory (address). The conversion of virtual memory to physical memory is done through the MMU. We try not to use virtual memory in production.

Several memory parameters that affect system performance:

  1. Overcommit_memory Excessive memory usage
  • 0 Default Settings The system determines whether to overuse.
  • 1 Do not overuse
  • Overuse but there is a default commit_ratio of 50 (it is the default 50). For example, physical memory 8GB, SWAP4g, can be overused by 10GB. Note: Try to avoid excessive use in production, such as Redis to close excessive use.
  1. Spappines – Swap inactive processes into swap. Note: Try not to use swap. Set in production: echp 10 > /proc/sys/vm-swappines
  2. Recovery of memory
  • Setting this value to 1, 2, or 3 causes the kernel to discard various combinations of page and slab caches. 1 The system becomes invalid and frees all the page buffers. That is, buffers 2 The system frees all unused slab buffers. The cached 3 system frees all page buffer and slab buffer memory. Used in production: 1. Run sync

    echo 3>/proc/sys/vm/drop_caches

i/o

The IO subsystem is generally the slowest part of a Linux system. One reason is its distance from the CPU, and the other reason is its physical structure. Therefore, minimize disk IO.

Disk scheduling policy:

# cat /sys/block/sda/queue/scheduler 
noop anticipatory deadline [cfq]
Copy the code

The CFQ policy is currently used. CFQ: Completely fair scheduling. In its time segment, the process can have up to eight requests at a time (tacit). The scheduler tries to estimate whether a program will issue more I/ OS in the near future based on historical data. The CFQ then sits idle, waiting for that I/O, even though other processes are waiting to issue an I/O deadline: Each request must be serviced by a specified deadline. 16. Noop: has no policy anticipatory: is used in a system of writing too much and reading too little.

The Linux kernel accesses disk IO in pages, typically 4K. Check page: /usr/bin/time -v date

MPF Linux maps the physical address space of memory to virtual memory, and the kernel maps only the required memory pages. When the application starts, the kernel searches the CPU cache and physical memory successively to see if there is a corresponding memory page. If there is no corresponding memory page, the kernel will initiate a MPF (Major Page Fault) to read the data from the disk and cache it into the memory.

If the corresponding memory page is found in the buffer cache, a MINOR Page fault (MnPF) is generated.

/usr/bin/time -v helloWorld The first run will find most MPF files and the second run will find most MnPF files

The File Buffer Cache

The file buffer cache, used to reduce MPF and increase MnPF, will continue to grow until there is less memory available or until the kernel needs to free up some memory for other applications. The low free memory does not indicate that the system is short of memory, but indicates that the Linux system is fully using memory as cache.

# cat /proc/meminfo 
MemTotal:        1004772 kB
MemFree:           79104 kB
Buffers:          105712 kB
Copy the code

Data pages can be written back to disk immediately using fsync() or sync(). If these functions are not called directly, pdFlush is flushed back to disk periodically.

Iotop displays the I/O usage of all processes lsOF displays all called and opened files

Other commands include vmstat SAR iostat top htop

The original:www.cnblogs.com/iteemo/p/56…