Mp.weixin.qq.com/s/HSuzeS2BB…

Twemproxy(Redis/Memcached Proxy)

Github.com/meitu/twemp…

Nginx uses master-wokers to support multiple processes

2. Support online reload

3. Support Redis master-slave read/write separation

4. Increased access latency (request and service response latency) distribution, which is extremely important for problem location and monitoring

Performance licking face said that it is absolutely better than some other current solutions, because it is a multi-process, internal pressure test in the context of uniform connection, with the linear growth of CPU, the final bottleneck in the network card. Internal machine GIGABit network adapter, 8-core in small packet scenario QPS up to 50W/s, if the network adapter has multi-queue binding multi-core performance will be better.

 

Introduction: After in-depth understanding of 360 piKA interpretation released by high availability architecture, the author found that the open source twemProxy component of Meitu team has more advanced advantages. Special National Day summarized the interpretation of relevant technology, a comprehensive introduction to the relevant system service system and many powerful functions. For Internet practitioners interested in learning more about the advanced uses of Redis and Memcached, read this article in detail.

Tian Yi Lin, Technical manager of Meitu, is mainly responsible for the r&d of NoSQL/ message queue/middleware and blockchain basic services. I have in-depth research on distributed cache and high availability architecture design, and led the team to complete meitu cache and message queue servitization. Before joining Meitu, he worked in the research and development of basic services on sina Weibo architecture platform.

 

Abstract

In the second half of 2017, Meitu began to plan to build the Redis/Memcached resource PaaS platform. However, after PaaS, it faced a problem of how to realize the resource reduction/expansion. In order to solve this problem, The Technical team of Meitu introduced TwemProxy as the resource gateway in November, 2017.

However, in long-term practice, its open source version cannot fully adapt to the actual situation of Meitu. Its main problems are that the single-thread model cannot use multi-core, and its performance is poor. The online Reload cannot be configured. Redis does not support master-slave mode; There is no delay index and other problems, so the technical team of Meitu has carried out the corresponding transformation. Based on the above, we realized the functions of multi-process and configuration online update, and added some related monitoring indicators of delay.

This article will explain twemProxy implementation and corresponding transformation in detail, hoping to provide some experience for other technical teams to learn from.

Why twemProxy

Twemproxy is an open source Redis/Memcached proxy developed by Twitter with the primary goal of reducing the number of connections to back-end resources and scaling capabilities for caching. Twemproxy supports multiple hash sharding algorithms and automatically removes failed nodes. In addition, other mature open source solutions are CODIS, which has online auto-scale and friendly back-end management, but the overall functionality is closer to Redis Cluster than agent. What Meitu needs is a proxy (gateway) for Redis and Memcached PaaS services, so we finally choose TwemProxy.

Twemproxy implementation

The main function of TwemProxy is to parse the user request and forward it to the backend cache resources, and forward the response back to the client after success.

At the heart of the code implementation are three types of connection objects:

  1. Proxy Connection, used to monitor the request of the user to establish a connection, the successful establishment of a connection will be corresponding to a client connection;
  2. The client connection is generated after the connection is successfully established. Users read and write data through the client. Connection After the request is parsed, a server is selected based on the key and hash rules to forward the request.
  3. Server Connection, which forwards user requests to cached resources and receives and parses the response data back to client Connection, which returns the response to the user.

The data flow of the three connections is shown below:

(The reason why the client connection in the figure above does not have IMSGQ is because the request can directly enter the imSGQ of the server after parsing)

  1. The user establishes a proxy connection and generates a client connection.
  2. The Client Connection starts reading the user’s request data, selects the server for the complete request based on the key and the set hash rule, and stores the request to the IMSGQ of the server.
  3. The Server Connection then sends the IMSGQ request to the remote resource, and when it’s done (write the TCP buffer) it migrates MSG from IMSGQ to OMSGQ, Find the corresponding MSG and client connection from the OMSGQ queue after the response comes back;
  4. Finally, the response content is put into the CLIENT Connection’s OMSGQ, and the client Connection sends the data back to the client.

The above mentioned data for user requests and resource responses is parsed and placed in an in-memory BUF, and the internal flow between client and server connections is just a Zero Copy of the pointer. This is one of the reasons the TwemProxy single-threaded model can achieve 10W QPS in small package scenarios, with almost no memory copy.

But for us, there are several issues with the current open source version:

  1. The single-threaded model does not take advantage of multiple cores and the performance is not good enough. In extreme cases, agents and resources need to be deployed 1:1.
  2. Online Reload cannot be configured, twitter internal version should be supported, there is a case for Reload in the unit test, and the PaaS scenario needs to be updated constantly;
  3. Redis does not support master-slave mode (Redis does not need to use master-slave mode in a cache scenario), but some scenarios do;
  4. Too few data indicators, no delay indicators at all.

Multiprocess version

Aiming at the above problems, the open source version of Meitu has made some modifications, the most core function is multi-process and online reload configuration. After the transformation, the overall process model is similar to Nginx, and the simple schematic diagram is as follows:

The function of master is to manage the worker process without receiving and processing user requests. If the worker process exits abnormally, the master will automatically start a new process to replace the old one. In addition, the master receives several signals from the user:

  • SIGHUP, reload the new configuration
  • SIGTTIN: Increases the log level. The higher the log level, the more detailed the log
  • SIGTTOU: lower the log level. The lower the log level, the fewer the logs
  • SIGUSR1, reopen the log file
  • SIGTERM, exit gracefully, wait for a period of time to exit
  • SIGINT, forcing exit

Several global configurations have also been added:

A few configurations other than worker_shutdown_timeout should make sense. Worker_shutdown_timeout specifies how long it takes for the old worker to exit after receiving the exit signal. This configuration is related to the multi-process implementation. We implement the configuration and online modification of the number of processes by starting a new process to replace the old process, so this configuration is used to specify the retention time of the old process.

reuse port

Before reuse port, the multithreaded/process service generally listens for connection requests in two ways:

  1. One thread is responsible for receiving all new connections, and the other thread services are only responsible for processing connections after they are established. The problem with this approach is that in short connection scenarios, the Accept thread can easily become a bottleneck (single core is generally around 4W +/s in our tests);
  2. All threads/processes simultaneously accept the same listening file handle. The problem with this approach is that under high load scenarios, different threads/processes wake up unevenly, and there is also a shock cluster effect (Accept /epoll also resolves the shock cluster problem in the newer kernel version).

Reuse port reuse allows multiple sockets to listen on the same port without uneven connection establishment. The reuse port is also fairly simple. It simply applies the reuse port flag to all the sockets listening to a port.

Although the reuse port was only incorporated in Linux 3.9, there are versions of backport that are older than that (at least the 2.6.32 version we’re using), and many blogs are somewhat misleading on this point. In addition, old listeners cannot be simply turned off on Reload. This will result in the loss of the TCP backlog of unaccepted connections with three successful handshakes, and the business will receive RST packets on these connections.

The way we solve this problem is to have the listener connections created and maintained on the master process, and the worker process simply inherits the listener connections directly after the fork. Therefore, during reload, the master can migrate the listening connections from the old worker to the new worker to ensure that the data in the TCP backlog will not be lost.

See nc_process. C# L172 for details. This is a way to ensure that the backlog does not lose data when the number of processes is constant or increased.

Redis master-slave mode

Native TwemProxy does not support Redis master-slave mode. This is mainly because TwemProxy treats Redis/Memcached as a cache rather than storage, so this master-slave structure is unnecessary and easy to operate. But for our internal business, some are not all caches, so we need this master-slave structure. The configuration is also simple:

If the server name is master, the instance is considered as the master instance. Only one master can be allowed in a pool. Otherwise, the configuration is invalid.

Statistical indicators

In my opinion, another problem existing in TwemProxy is the complete lack of delay index, which is unfavorable for troubleshooting problems and monitoring and alarm. To solve this problem, we added two delay indicators:

  1. Request latency refers to the latency from the client to the return, including the internal and server time of twemProxy. This indicator is closer to the service time.
  2. Server latency indicates the latency of twemProxy requests to servers. This can be interpreted as the latency of Redis/Memcached server.

In the case of occasional problems, according to the two delays, it can be determined whether twemproxy, server or client problems (such as GC) lead to slow requests. In addition, it can also monitor and alarm the ratio of slow requests. These two metrics are recorded in buckets, such as <1ms, <10ms, and so on.

There are still problems

  1. When the number of workers decreases, the TCP backlog of destroyed old workers will be lost, leading to some connection timeout.
  2. Unix sockets do not reuse port, so they are still single-process but can be reloaded online.
  3. There is no support for Memcached binary protocol, PR was offered a few years ago but never entered master;
  4. The maximum number of connections on the client is configured but does not actually take effect, which we will add later;
  5. Incomplete command support (mainly lack of key and some blocking directives);
  6. The inconsistent configuration of old and new processes during Reload may result in dirty data.

Performance pressure measurement

The following data is obtained in the long-connection packet scenario to verify whether the multi-process version is as expected. Performance increases linearly with the number of CPU cores until no other hardware reaches the bottleneck.

Pressure measurement environment is as follows:

  • CentOS 6.6 
  • CPU Intel E5-2660 32 logical cores
  • The memory of 64 g
  • Two GIGABit network cards as bond0

The performance of a single worker scenario is similar to that before twemProxy transformation, which is about 10W. With the increase in the number of workers, the following performance and workers basically maintain online growth, which is in line with expectations. The bottleneck with more than 8 cores is that the performance of a single nic reaches the bottleneck due to uneven packet reception in bond0 mode and data cannot be used for reference. The above data is also our own environment pressure measurement data, you can verify. If multiple nics are used, bind interrupts or multiple queues to multiple cpus to avoid CPU0 soft interrupt processing bottlenecks.

The last

The implementation of the multi-process version of TwemProxy is relatively simple, but a number of twemProxy details were discovered and fixed during the process (part of the user report), such as mBUF does not shrink once allocated, causing memory to increase and then not decrease. A lot of feature details are being refined, and we only maintain one version on Github.

Code address: github.com/meitu/twemp…

In addition, our team is currently open sourcing several other projects:

  • Golang version of Kafka Consumer Group
  • Kafka Consumer Group for PHP
  • DPoS implementation based on Ethereum

There will be more open source things in the future, welcome to pay more attention to ~

Finally, thank you to the code contributors: @ruoshan @Karelcrcr@Huang-lin @Git-hulk

reference

1 twitter/twemproxy github.com/twitter/twe…

2 meitu/twemproxy github.com/meitu/twemp…

3 Twmeporxy binary memcached github.com/twitter/twe…

4 SoreusePort: TCP/IPv4 implementation lwn.net/Articles/54…

5 The SO_REUSEPORT socket option lwn.net/Articles/54…