instructions

The following is an introductory introduction, intended to be a comprehensive summary of older technologies rather than an in-depth study. See the book Building High-performance Web Sites.

What is server concurrency capability

The more requests a server can process per unit of time, the higher the capacity of the server, that is, the higher the concurrent processing capacity of the server

Is there any way to measure server concurrency

1. The throughput rate

Throughput rate: Indicates the maximum number of requests processed by the server in a unit of time, expressed in req/s

From the perspective of the server, the actual number of concurrent users can be understood as the total number of file descriptors representing different users maintained by the server, that is, the number of concurrent connections.

Servers typically limit the maximum number of users that can be served simultaneously, such as apache’s MaxClents parameter.

For the server, the server wants to support a high throughput rate, while for the user, the user only wants to wait for the minimum time. Obviously, both parties cannot meet this requirement, so the balance between the interests of both parties is the maximum number of concurrent users we want.

2. Stress test

There is a principle must be clear first, if 100 users at the same time to the server for 10 requests, and 1 user to the server for 1000 consecutive requests, is the pressure on the server the same?

Actually, it’s different, because for each user, sending requests continuously actually means sending a request and receiving the response data before sending the next request.

So for a user to the server requests for 1000 times in a row, at any time the server nic only one request in the receive buffer, with 100 users separately 10 requests to the server at the same time, the server nic receive buffer up to 100 requests, waiting to be processed clearly at this moment the server pressure is bigger.

Conditions to be considered before the stress test

  • Concurrent users: The total number of concurrent users sending requests to the server at any one time (HttpWatch)
  • The total number of requests
  • Requested Resource Description
  • Request wait time (user wait time)
  • Average waiting time for user requests
  • Average server request processing time
  • Hardware environment

The time concerned in the stress test is subdivided into the following 2 types:

  • Average user request waiting time (data transmission time on the network and computing time on the user’s PC are not taken into account)
  • Average server request processing time

Average user request waiting time is used to measure the service quality of a single user with a certain number of concurrent users. The average server request processing time is the inverse of the throughput rate.

In general, average request waiting time of users = average request processing time of the server * number of concurrent users

How to improve the concurrent processing capacity of the server

1. Improve the CONCURRENT computing capability of the CPU

The server can process multiple requests at the same time because the operating system uses multiple execution flow architecture so that multiple tasks can take turns using system resources.

These resources include CPU, memory, and I/O. I/O mainly refers to disk I/O and network I/O.

Multi-process & multi-thread

A common implementation of multiple execution streams is processes. The benefits of multiple processes can be the rotation of CPU time, overlapping CPU calculations and IO operations. The IO here refers mainly to disk IO and network IO, which are woefully slow compared to CPU.

In reality, most processes spend most of their time on I/O operations.

The DMA technology of modern computers can make THE CPU not participate in the whole process of I/O operations. For example, the process sends instructions to I/O devices such as network cards or disks through system calls. Then the process is suspended and CPU resources are released.

This is especially important for single-tasking, where the CPU is idle most of the time.

Multiple processes can not only improve CPU concurrency. Its advantages also lie in the stability and robustness provided by a separate memory address space and life cycle, where the collapse of one process does not affect the other.

However, processes also have the following disadvantages:

Fork () system calls are expensive: prefork inter-process scheduling and context switching costs: reduce memory duplication with a large number of processes; shared memory IPC programming is relatively cumbersome

Reduce process switching

When hardware contexts are loaded and removed frequently, the time consumed can be significant. You can use the Nmon tool to monitor the number of context switches per second on the server.

To minimize the number of context switches, the simplest thing to do is to reduce the number of processes and design a concurrency strategy using threads as much as possible in conjunction with other I/O models.

You can also consider using process-bound CPU technology to increase the CPU cache hit ratio. If the process keeps switching between cpus, the old CPU cache will be invalidated.

Reduce unnecessary locks

When the server processes a large number of concurrent requests, there is some competition for resources when processing multiple requests. In this case, the “lock” mechanism is generally used to control the occupation of resources.

When a task occupies a resource, we lock the resource while other tasks are waiting for the lock to be released. This phenomenon is called lock contention.

By the nature of lock contention, we are aware of minimizing concurrent requests competing for shared resources.

For example, turning off server access logging when allowed can greatly reduce latency while locks are waiting. Minimize innocent waiting time.

Here we talk about lockless programming, which is accomplished by the kernel. It mainly uses atomic operations to replace locks to achieve access protection for shared resources.

With atomic operations, the lock instruction is used during the actual write operation, which prevents other tasks from writing to the memory and avoids data contention. Atoms operate faster than locks, usually more than twice as fast.

For example, fwrite() and fopen() use append to write files. Its principle is to use lock-free programming, which has high complexity, fast efficiency and low deadlock probability.

Consider process priorities

The process scheduler dynamically adjusts the priority of the processes in the run queue and observes the PR value of the process through top

Consider system load

You can view /proc/loadavg at any time, as can load Average in top

Consider CPU usage

In addition to user space and kernel space CPU utilization, there is also the I/O wait, which is the proportion of time that the CPU is idle and waiting for I/O operations to complete (see the value of WA in top).

2. Consider reducing memory allocation and release

During the working process of the server, a large amount of memory is needed, so the allocation and release of memory is particularly important.

Memory allocation of intermediate temporary variables and data replication time can be reduced by improving data structure and algorithm complex, and the server itself uses its own strategies to improve efficiency.

For example, Apache allocates a large amount of memory as a memory pool at the start of the operation. If required, it directly obtains the memory from the memory pool, avoiding the memory defragmentation time caused by frequent memory allocation and release.

Another example is Nginx’s use of multiple threads to process requests, which enables multiple threads to share memory resources, thus greatly reducing its overall memory usage.

In addition, Nginx’s phased memory allocation strategy, allocating on demand and releasing in a timely manner, keeps memory usage within a very small range.

In addition, you can consider shared memory.

Shared memory refers to a large amount of memory that can be accessed by different cpus or shared by different processes in a multi-processor computer system. It is a very fast process communication method.

But there is a downside to using shared memory, which is that data is not consistent across multiple machines.

The shell command ipcs can be used to display the status of shared memory in the system, the function shmget can create or open a shared memory area, the function shmat can connect an existing shared memory segment to the process space, the function SHMCTL can perform various operations on the shared memory segment, the function SHMDT can separate the shared memory.

3. Consider using persistent connections

A persistent connection is a common TCP communication mode, that is, a TCP connection is continuously opened by sending multiple data points.

The opposite way is called short connection, that is, after establishing a connection, sending one piece of data, then disconnecting, and then establishing a connection again to send the next piece of data, and so on.

Whether to use persistent connections depends entirely on the application characteristics.

From the perspective of performance, the operation of establishing TCP connections is not a small cost. If the number of connections is reduced, the performance is improved. Especially for intensive pictures or web pages and other small data request processing has obvious acceleration.

HTTP long Connection requires cooperation between the browser and the Web server. Currently, browsers generally support long Connection. The HTTP request header contains a statement about long Connection, as follows: Connection: keep-alive

Mainstream Web servers support long connections. For example, in Apache, you can disable long connections with KeepAlive off.

Another key point for efficient use of long connections is the setting of long connection timeouts, i.e., when long connections are closed?

The default value of Apache is 5s. If this value is set too long, invalid resource usage may occur and a large number of idle processes may be maintained, affecting server performance.

4. Improve I/O model

I/O operations can be divided into many types depending on the device, such as memory I/O, network I/O, and disk I/O

For network I/O and disk I/O, they are much slower, although disk I/O speeds can be accelerated by parallel disk drives using RAID disk arrays, and network I/O speeds can be improved by purchasing dalian exclusive network bandwidth and using high-bandwidth network adapters.

But these I/O operations require kernel system calls, which require the CPU to schedule, forcing the CPU to waste valuable time waiting for slow I/O operations.

We hope to make CPU less time on the I/O operation scheduling, how to make high-speed CPU and slow I/O equipment better coordination work, is the topic of modern computer has been discussed. The essential difference between the VARIOUS I/O models is how the CPU participates.

The DMA technology

Data transfer between the I/O device and memory is done by the DMA controller. In DMA mode, the CPU simply issues commands to the DMA and lets the DMA controller handle the data transfer, which saves system resources significantly.

Asynchronous I/O

Asynchronous I/O refers to that the process can continue to process other tasks after requesting data and then wait for the notification of THE I/O operation. In this way, the process does not block when reading or writing data.

Asynchronous I/O is non-blocking, and by the time the function returns, the actual I/O transfer is complete, which gives a good overlap between CPU processing and I/O operations.

I/O multiplexing

It is necessary for the epoll server to process a large number of file descriptors at the same time. If the synchronous non-blocking I/O model is adopted, if the TCP connection data is received at the same time, the method of receiving data must be called for each socket in turn, regardless of whether the socket has any data to receive, it must be asked once.

If most sockets have no data to receive, the process will waste a lot of CPU time checking to see if those sockets have any data to receive.

The advent of multipath I/O ready notification provides a high-performance solution for ready checking of a large number of file descriptors, which allows a process to monitor all file descriptors simultaneously in one way, quickly obtain all ready file descriptors, and then perform data access only on those file descriptors.

Epoll can support both horizontal triggering and edge triggering, which theoretically has higher performance, but the code implementation is complicated because any unexpected missing event can cause request processing errors.

Epoll has two major improvements:

Epoll only tells you about ready file descriptors, and when epoll_wait() is called to get the file descriptor, instead of the actual descriptor, it returns a value that represents the number of ready descriptors, and then simply retrieves the corresponding number of file descriptors in an array specified by epoll. Memory mapping (MMAP) techniques are used, which completely eliminates the overhead of copying these file descriptors during system calls. Epoll uses event-based readiness notification. It registers each file descriptor with epoll_ctrl() in advance, and once a file descriptor is ready, the kernel uses a callback mechanism similar to callback to be notified when the process calls epoll_wait()

For the IO model, you can refer to my previous article Java NIo.2. For epoll, see my previous articles select, poll, and epoll introduction.

Sendfile

Most of the time, we request static files from the server, such as images, style sheets, etc.

When these requests are processed, the data from the disk file passes through the kernel buffer, then into user memory space without any processing, then into the nic’s corresponding kernel buffer, and then into the NIC for sending.

Linux provides the sendFile () system call, which delivers a specific portion of a disk file directly to the socket descriptor representing the client, speeding up static file requests while reducing CPU and memory overhead.

Application scenario: For static files with small requests, sendFile is less useful because sending data takes a much smaller percentage of the time in the process than for large requests.

The memory mapping

The Linux kernel provides a special way to access disk files by associating a block of memory address space with a specified disk file, thus converting access to that block of memory into access to disk files. This technique is called memory mapping.

In most cases, memory mapping can improve disk I/O performance by not using system calls such as read() or write() to access files, but instead using mmap() system calls to associate memory with disk files and then access the files as freely as if they were memory.

Disadvantages: When dealing with large files, memory mapping can cause a large memory overhead, which is not worth the cost.

Direct I/O

In Linux 2.6, there is no essential difference between memory mapping and direct access files because the data is copied twice, between disk and kernel buffer and between kernel buffer and user-mode memory space.

Introducing the kernel buffer aims to improve the access performance of disk files, but for some complex applications, such as a database server, which in order to further improve performance, want to bypass the kernel buffer, implementation and management in user mode space by itself I/O buffers, such as a database according to a more reasonable strategies to improve the query cache hit ratio.

On the other hand, bypassing the kernel buffer can also reduce the overhead of system memory, since the kernel buffer itself uses system memory.

Linux adds the O_DIRECT option to the open() system call to bypass the kernel buffer and access files directly for direct I/O.

In Mysql, Innodb storage engine itself for data and index cache management, can be allocated raw partition in my.cnf configuration skip the kernel buffer, to achieve direct I/O.

5. Improve the server concurrency policy

The server concurrency policy aims to ensure that I/O operations and CPU calculations overlap as much as possible. On the one hand, the CPU is not idle during I/O waiting and on the other hand, the CPU spends as little time on I/O scheduling as possible.

One process processes one connection, non-blocking I/O

When multiple concurrent requests arrive at the same time, the server must prepare multiple processes to handle the requests. Its process overhead limits its number of concurrent connections.

However, from the perspective of stability and compatibility, it is relatively safe. The crash of any child process does not affect the server itself, and the parent process can create a new child process. A good example of this strategy is Apache’s fork and Prefork patterns.

Apache is fine for sites with low concurrency (such as 150 or less) that also rely on other Apache features.

One thread processes one connection, non-blocking IO

This approach allows multiple connections to be handled by multiple threads within a process, with one thread handling one connection. Apache’s worker mode is a good example of this, enabling it to support more concurrent connections. However, the overall performance of this mode is not as good as that of Prefork, so worker mode is generally not used.

One process handles multiple connections, asynchronous I/O

The underlying prerequisite for a thread to process multiple connections simultaneously is the use of IO multiplexing ready notifications.

In this case, the process that handles multiple connections is called the worker process or the server process. The number of workers can be configured, such as worker_processes 4 in Nginx.

One thread handles multiple connections, asynchronous IO

Even with high performance I/O multiplexing readiness notifications, disk I/O waits are inevitable. A more efficient approach is to use asynchronous IO for disk files, which few Web servers currently support in a meaningful sense.

6. Improve the hardware environment

Another point to mention is the hardware environment. The hardware configuration of the server is often the most direct and easiest way to improve the performance of the application, which is called Scale Up. I’m not going to do that here.