This article was originally written by Chen Gang, r&d engineer of Xiaomi INFORMATION Technology team. The original title is “What are we talking about when we talk about high concurrency?” For better content presentation, im has been revised and changed from time to time.

1, the introduction

In the instant communication network community, most of them are developers of real-time communication such as IM, message push, customer service system and audio and video chat. When it comes to instant communication technology, the most common topic is high concurrency, high throughput and large number of users.

Code has not yet begun to write, consider in case which day this IM users break millions, millions of how to do the problem, is the basic cultivation of most programmers (although the product on the market may die, but the “far-sighted” time, should not be lazy, otherwise how to raise wages with the boss…..) .

When it comes to instant messaging-related work, high concurrency is a must. What exactly is high concurrency? Well,, if I have to explain why, it’s a bit confusing.

2. Series of articles

This is the seventh in a series of articles with the following table of contents:

“High Performance Network Programming (1) : How many Concurrent TCP connections can a Single server Have?”

“High Performance Network Programming part II: The Famous C10K Concurrent Connection Problem in the Last Decade”

High Performance Network Programming (PART 3) : The Next 10 Years, It’s Time to Consider C10M Concurrency

“High Performance Network Programming (IV) : Theoretical Exploration of High Performance Network Applications from C10K to C10M”

High Performance Network Programming part 5: Understanding the I/O Model in High Performance Network Programming

High Performance Network Programming (6) : Understanding the Threading Model in High Performance Network Programming

High Performance Network Programming (7) : What is High Concurrency? (this paper)

High Performance Network Programming Classic: The C10K Problem

3. What is high concurrency?

High concurrency is one of the performance indexes of Internet system architecture. It usually refers to the number of requests that the system can process simultaneously per unit of time.

Queries per second (QPS)

So what exactly are we talking about when we talk about high concurrency? After all, what is high concurrency?

Don’t worry, we continue to read….

4. What is high concurrency?

Here is the conclusion:

1) The basic performance of high concurrency is the number of requests that the system can process at the same time per unit time;

2) The core of high concurrency is an efficient squeeze on CPU resources.

** For example: ** If we developed an application called MD5 enumeration, each request would carry an MD5 encrypted string, and eventually the system would enumerate all the results and return the original string. At this point, our application scenario or application business is CPU intensive rather than IO intensive.

At this point, the CPU is always doing efficient calculations, even running up to full CPU utilization, and talking about high concurrency doesn’t make any sense. (Of course, we can increase the concurrency by adding machines, that is, adding CPUS. This is a normal ape knows nonsense, there is no point in talking about adding machines, there is no high concurrency that can not be solved by adding machines, if there is, then you have not added enough machines!)

For most Internet applications, the CPU is not and should not be the bottleneck of the system, which spends most of its time waiting for I/O (hard disk/memory/network) read/write operations to complete.

At this point, some people might say, when I look at the system monitor, memory and network are fine, but CPU utilization is full. Why?

This is a good question, and I’ll give you a practical example later in this article, to reemphasize the four words “squeezing effectively” above, which will cover the entire content of this article!

5. Control variable method

Everything is interconnected, and when we’re talking about high concurrency, every part of the system needs to match it.

Let’s review a classic C/S HTTP request flow:

As shown by the serial number in the figure above:

1) We will pass the RESOLUTION of DNS server and the request will reach the load balancing cluster;

2) The load balancing server will allocate requests to the service layer according to the configured rules. The services layer is also our business core layer, and there may be some RPC, MQ calls, and so on;

3) Through the cache layer;

4) Persist data at last;

5) Return data to the client.

To achieve high concurrency, we need the load balancing, service layer, cache layer, and persistence layer to be highly available and high performance.

Even at step 5, we can do optimization by compressing static files, HTTP2 pushing static files, CDN, and we could write books on optimization for each of these layers.

This article focuses on the service layer, the part of the diagram circled in red. No more database, cache related implications.

We learned in high school that this is called the control variable method.

6. Concurrency

6.1 Evolution history of network programming model:

Concurrency has always been an important and difficult problem in server-side programming. In order to optimize the amount of concurrency in the system, from the initial Fork process, to process pool/thread pool, to epoll event drivers (Nginx, Node.js anti-human callback), and then to the coroutine.

As you can see from the figure above, the whole evolution is a process of squeezing efficient CPU performance.

What? Not clear?

6.2 Let’s talk about context switching again:

Before we talk about context switching, let’s clarify the concepts of two nouns:

1) Parallelism: two events complete at the same time;

2) Concurrency: two events occur alternately in the same period of time. From a macro perspective, both events occur.

Threads are the smallest unit of operating system scheduling and processes are the smallest unit of resource allocation. Since cpus are serial, for a single-core CPU, only one thread must be occupying CPU resources at a time. As a result, Linux, as a multitasking (process) system, is subject to frequent process/thread switches.

Before each task is executed, the CPU needs to know where to load from and where to run from. This information is stored in the CPU register and the operating system’s program counter. These two things are called CPU context.

The process is managed and scheduled by the kernel, the process switch can only happen in the kernel state, so the virtual memory, stack, global variables and other user space resources, as well as the kernel stack, register and other kernel space state, is called the process context.

As mentioned earlier, threads are the smallest unit of operating system scheduling. At the same time, threads share resources such as the parent process’s virtual memory and global variables, so the parent process’s resources plus its own private data online is called the thread’s context.

For context switches of threads, threads of the same process consume fewer resources than multi-process switches because they share resources.

It is now easier to explain that switching between processes and threads results in CPU context switching and process/thread context switching. These context switches consume additional CPU resources.

6.3 Further talk about context switching of coroutines:

So coroutines don’t need context switches? Need, but won’t produce CPU context switch and process/thread context switching, because these switches are in the same thread, namely the switch in user mode, you can even simple understanding for coroutines context switch between, is to move your program pointer, CPU resources still belongs to the current thread.

For a deeper understanding, take a closer look at the Go GMP model.

The net effect is that coroutines squeeze efficient CPU utilization even further.

7. Back to the original question

At this point, some people might say, when I look at the system monitor, memory and network are fine, but CPU utilization is full. Why?

In this article, the CPU utilization rate is very high, which is a lot of inefficient calculation.

Take “The best language in the world” for example.

Typical PHP-FPM CGI mode, each HTTP request:

1) read hundreds of PHP files of the framework;

2) all MYSQL/REIDS/MQ connections are re-established/released;

3) will be re-interpreted dynamically to compile and execute PHP files;

4) Switching and switching between different PHP-FPM processes.

This CGI mode of running PHP is ultimately responsible for its disastrous performance at high concurrency.

Finding a problem is often harder than solving it. When we understand high concurrency, we realize that high concurrency and high performance are not limited by your programming language, but by your mind.

Find the problem, solve the problem! What can we achieve when we can squeeze CPU performance effectively?

Let’s take a look at the performance difference between PHP +Swoole’s HTTP service and Netty’s HTTP service, Java’s high-performance asynchronous framework.

8. Preparation before performance comparison

What swoole is:

Swoole is a high-performance event based asynchronous & coroutine parallel network communication engine written in C and C++ for PHP. Link: www.swoole.com/.

What is Netty?

Netty is a well-known Java high performance network communication open source framework. Netty provides an asynchronous, event-driven network application framework and tools to rapidly develop high-performance and reliable network server and client programs. Liverpoolfc.tv: netty. IO /, online source: docs.52im.net/extend/docs… .

What is the maximum number of TCP connections for a single machine?

Recall from computer networking that at the transport layer, there are three handshakes before each TCP connection is established.

Each TCP connection consists of:

1) Local IP

2) Local port;

3) Remote IP address;

4) Remote port.

Consists of four attribute identifiers.

The TCP header is as follows:

** Off topic: ** If the TCP protocol face, the authoritative “TCP/IP” walk…

As shown above:

1) The local port consists of 16 bits, so the maximum number of local ports is 2^16 = 65535;

2) The remote port consists of 16 bits, so the maximum number of remote ports is 2^16 = 65535.

At the same time, in Linux’s underlying network programming model, the operating system maintains a File descriptor(fd) for each TCP connection. The number of FD’s can be checked and modified by using the ulimit -n command. Before testing, you can run the following command: Ulimit -n 65536 Changes the limit to 65535.

Therefore, regardless of hardware resource constraints:

1) The maximum number of HTTP connections is: the maximum number of local ports is 65535 x the number of local IP addresses 1 = 65535.

2) The maximum number of REMOTE HTTP connections is: the maximum number of remote ports is 65535 x the number of remote (client) IP addresses +∞ = unlimited ~~.

**PS: ** In fact, the operating system will have some reserved port occupancy, so the number of local connections is actually less than the theoretical value. To delve deeper into this issue, read the first article in this series, high Performance Network Programming (PART 1) : How many Concurrent TCP connections can a Single server Have.

9. Performance comparison

9.1 Preparing for the Test

Hardware resources: one Docker container, 1G memory +2 core CPU, as shown in the figure:

Docker-compose is composed as follows:

# java8

Version: “2.2”

services:

java8:

container_name: “java8”

hostname: “java8”

image: “java:8”

volumes:

– /home/cg/MyApp:/MyApp

ports:

– “5555:8080”

environment:

– TZ=Asia/Shanghai

working_dir: /MyApp

cpus: 2

Cpuset: 0, 1

mem_limit: 1024m

memswap_limit: 1024m

mem_reservation: 1024m

tty: true

# php7-sw

Version: “2.2”

services:

php7-sw:

container_name: “php7-sw”

hostname: “php7-sw”

Image: “mileschou/swoole: 7.1”

volumes:

– /home/cg/MyApp:/MyApp

ports:

– “5551:8080”

environment:

– TZ=Asia/Shanghai

working_dir: /MyApp

cpus: 2

Cpuset: 0, 1

mem_limit: 1024m

memswap_limit: 1024m

mem_reservation: 1024m

tty: true

The PHP code:

set(\[ ‘worker\_num’=> 2 \]); $http->on(“request”, function($request, Response $Response) {//go(function () use ($Response) {// Swoole\\Coroutine::sleep(0.01); $response->end(‘Hello World’); / /}); }); $HTTP ->on(“start”, function(Server $Server) {go(function() use($Server) {echo” Server listen on 0.0.0.0:8080 \\n”; }); }); $http->start();

**Java key code: ** source code from github.com/netty/netty

public static void main(String[] args) throws Exception {

// Configure SSL.

finalSslContext sslCtx;

if(SSL) {

SelfSignedCertificate ssc = newSelfSignedCertificate();

sslCtx = SslContextBuilder.forServer(ssc.certificate(), ssc.privateKey()).build();

} else{

sslCtx = null;

}

// Configure the server.

EventLoopGroup bossGroup = newNioEventLoopGroup(2);

EventLoopGroup workerGroup = newNioEventLoopGroup();

try{

ServerBootstrap b = newServerBootstrap();

b.option(ChannelOption.SO_BACKLOG, 1024);

b.group(bossGroup, workerGroup)

.channel(NioServerSocketChannel.class)

.handler(newLoggingHandler(LogLevel.INFO))

.childHandler(newHttpHelloWorldServerInitializer(sslCtx));

Channel ch = b.bind(PORT).sync().channel();

System.err.println(“Open your web browser and navigate to “+

(SSL? “HTTP “) + “://127.0.0.1:”+ PORT + ‘/’);

ch.closeFuture().sync();

} finally{

bossGroup.shutdownGracefully();

workerGroup.shutdownGracefully();

}

}

Since I have given only two core CPU resources, only one work process can be enabled for both services. Port 5551 indicates PHP service, and port 5555 indicates Java service.

9.2 Comparison of Pressure Measurement Tools: ApacheBench (AB)

Ab command: docker run –rm jordi/ ab-k -c 1000-n 1000000 http://10.234.3.32:5555/

The results in a benchmark of 1 million Http requests made concurrently with 1000 are as follows.

Java + Netty pressure test results:

PHP + Swoole pressure test results:

** PS: ** The best results under three pressure tests are selected in the figure above.

Overall, the performance difference is modest, and the PHP+ Swoole service is even slightly better than the Java+ Netty service, especially in terms of memory usage, which is 600MB for Java versus 30MB for PHP.

What does that mean?

There are no IO blocking operations and no coroutine switching occurs. This just goes to show that multi-threaded +epoll mode can effectively squeeze CPU performance, and you can write high concurrency and high performance services even in PHP.

10. Performance Comparisons — Moments of wonder

The above code does not show the best performance of the coroutine as there is no blocking operation on the whole request, but often our application is associated with various blocking operations such as document reading, DB connection/query, etc. Let’s take a look at the results of the pressure test with the addition of blocking operations.

I added sleep(0.01) // second to both Java and PHP code to simulate 0.01 second system call blocking.

The code will not be posted again.

Java + Netty with IO blocking

It takes about 10 minutes to run all the pressure tests…

PHP + swoole with IO blocking

** You can see from the results: ** The QPS of the CORoutine based PHP + swoole service is 6 times higher than that of the Java + Netty service.

Of course, both of these tests are source code from the official demo, so there are definitely more configurations that can be optimized, and the results will be much better.

** Why isn’t the official default number of threads/processes set a little higher?

The number of processes/threads is not always the best, as we have already discussed the additional CPU costs associated with process/thread switching, especially between user and kernel mode.

11. Summary of this paper

For the results of the above two sections, I am not targeting Java, but I would like to say: Once you understand what is at the heart of high concurrency and find that target, you can build a high concurrency and high performance system in any programming language with effective optimization for CPU utilization (connection pooling, daemons, multithreading, coroutines, SELECT polling, epoll event-driven).

So, now you understand what high concurrency is?

Thinking is always more important than results! (This article is simultaneously published at: www.52im.net/thread-3120…)