This article is shared by Minxiang Du, graphite document technology. The original topic “Websocket million long connection technology practice of graphite document” has been revised.

1, the introduction

In some services of graphite documents, such as document sharing, comments, slide presentation and document table following, it involves real-time data synchronization of multiple clients and online bulk data Push of servers. The general HTTP protocol cannot meet the scenarios of server actively pushing data. Therefore, we choose WebSocket scheme for business development.

With the development of graphite document business, the peak daily connections have reached millions of magnitude. The increasing number of user connections and the architecture design that does not meet the current magnitude have led to a sharp increase in memory and CPU usage. Therefore, we considered to reconstruct the long-connection gateway.

This article shares the evolution process of graphite document Long connection gateway from 1.0 architecture to 2.0, and summarizes the entire performance optimization practice.

2. Thematic catalogue

This is the sixth in a series of articles with the following table of contents:

  • “Long Connection Gateway Technology Topic (1) : Production grade TCP Gateway Technology Practice Summary of Beijing Tokyo Mai”
  • “Long Connection Gateway Technology Topic (2) : Zhihu High Performance Long Connection Gateway Technology Practice of Ten Million Level Concurrency”
  • Long Connection Gateway Technical Topic (3) : The Evolution Of Mobile Terminal Access Layer Gateway Technology
  • Long Connection Gateway Technology Topic (4) : IQiyi WebSocket Real-time Push Gateway Technology Practice
  • Long Connection Gateway Technology Topic (5) : Himalayan Self-developed 100 million level API Gateway Technology Practice
  • “Long Connection Gateway Technology Topic (6) : Graphite Document single machine 500 thousand WebSocket Long Connection Architecture Practice” (* article)

3. Problems with V1.0 architecture

The v1.0 version of this long-connection gateway system is modified and developed using Node.js based on socket. IO, which well meets the requirements of business scenarios under the user level at that time.

3.1 Architecture Introduction

Architecture Design drawing of version 1.0:

Client connection process 1.0:

  • 1) The user connects to the gateway through NGINX, and the operation is aware by the business service;

  • 2) After the business service senses the user connection, it will conduct relevant user data query and then send the message Pub to Redis;

  • 3) The gateway service receives messages through Redis Sub;

  • 4) Query user session data in the gateway cluster and push messages to the client.

3.2 Problems

Although version 1.0 of the long-connection gateway worked well online, it did not support the expansion of subsequent services well.

And the following problems need to be solved:

  • 1) Resource consumption: Nginx only uses TLS to decrypt and request transparent transmission, resulting in a large amount of resource waste. Meanwhile, the previous Node gateway has poor performance and consumes a large amount of CPU and memory.

  • 2) Maintenance and observation: the monitoring system without access to graphite cannot be connected with the existing monitoring and alarm, and there are certain difficulties in maintenance;

  • 3) Business coupling problem: since business services and gateway functions are integrated into the same service, it is impossible to conduct targeted horizontal expansion for performance loss of part of the business. In order to solve the performance problem and the subsequent module expansion capability, service decoupling is required.

4. V2.0 architecture evolution practice

4.1 an overview of the

The V2.0 version of the long-link gateway system needs to address a number of issues.

For example, graphite documents have many internal components (documents, tables, slides, forms, etc.), and in version 1.0 business calls from components to gateways could be made through Redis, Kafka, and HTTP interfaces, with untractable sources and difficult to control.

In addition, from the perspective of performance optimization, the original services need to be decoupled, and the 1.0 gateway is split into the gateway function part and the business processing part.

Concrete is:

  • 1) Ws-Gateway: integrated user authentication, TLS certificate authentication and WebSocket connection management, etc.

  • 2) The business processing part is WS-API: the component service directly communicates with the service through gRPC.

In addition:

  • 1) Targeted expansion can be carried out for specific modules;

  • 2) Service refactoring combined with Nginx removal significantly reduced overall hardware consumption;

  • 3) Service integration into graphite monitoring system.

4.2 Overall Architecture

Version 2.0 architecture design drawing:

2.0 client connection process:

  • 1) The client establishes WebSocket connection with ws-Gateway service through handshake process;

  • 2) After the connection is established successfully, THE WS-Gateway service stores the session node, caches the connection information mapping to Redis, and pushes the client online message to WS-API through Kafka;

  • 3) WS-API receives client on-line message and client uplink message through Kafka;

  • 4) WS-API service preprocesses and assembles messages, including obtaining the necessary data of message push from Redis and completing the filtering logic of message push, and then Pub messages to Kafka;

  • 5) WS-Gateway obtains messages that the server needs to return through Sub Kafka and pushes messages to the client one by one.

4.3 Handshake Process

When the network is in good condition, complete Step 1 to Step 6 as shown below, and directly enter the WebSocket process. When the network environment is poor, the communication mode of WebSocket degenerates to HTTP. The client pushes messages to the server through POST, and then returns data from the reading server through GET long polling.

The handshake process of the client’s initial request for establishing the server connection:

The process is described as follows:

  • 1) The Client sends a GET request to try to establish a connection.

  • 2) The Server returns relevant connection data, sid is the unique Socket ID generated for this connection, and subsequent interaction is used as the credential: {“sid”:”xxx”,”upgrades”:[“websocket”],”pingInterval”:xxx,”pingTimeout”:xxx}

  • 3) The Client requests again with the SID parameter in Step 2.

  • 4) The Server returns 40, indicating that the request is received successfully.

  • 5) The Client sends a POST request to confirm the status of later degraded channels;

  • 6) The Server returns OK, and the first-stage handshake process is completed.

  • 7) Try to initiate WebSocket connection. First, respond to 2Probe and 3Probe requests. After confirming that the communication channel is unblocked, normal WebSocket communication can be carried out.

4.4 TLS Memory Consumption Optimization

WSS protocol is used to establish the connection between the client and the server. In version 1.0, TLS certificate is mounted on Nginx, and HTTPS handshake process is completed by Nginx. In order to reduce the machine cost of Nginx, in version 2.0 we mounted certificates to the service.

By analyzing the service memory, as shown in the figure below, the memory consumed during TLS handshake accounts for about 30% of the total memory consumption.

Memory consumption in this part is unavoidable, and we have two options:

  • 1) Mount TLS certificates on layer 7 load balancing, and transfer the TLS handshake process to a tool with better performance;

  • 2) Optimize the performance of Go’s TLS handshake process. I learned from the communication with cao Chunhui (Cao Da), the industry leader, that he recently submitted PR in Go official library and related performance test data.

4.5 Socket ID Design

For each connection must generate a unique code, if the occurrence of repeated will lead to string number, message confusion push problems. Select the SnowFlake algorithm as the unique code generation algorithm.

In the scenario of a physical airport, a fixed number is assigned to the physical machine where the replica resides to ensure that the Socket ID generated by the service on each replica is unique.

In the K8S scenario, this solution is not feasible, so the number is returned by registration. After all the copies of ws-Gateway are started, the startup information of the service is written to the database to obtain the copy number, which is used as the parameter to produce the Socket ID as the copy number of SnowFlake algorithm. After the service is restarted, the existing copy ID is inherited. When a new version is delivered, the new copy ID is delivered based on the self-added ID.

At the same time, the WS-Gateway copy writes heartbeat information to the database as a health check for the Gateway service itself.

4.6 Cluster Session Management Solution: Event broadcast

After the client completes the handshake process, the session data is stored in the memory of the current gateway node, and part of the serializable data is stored in Redis. The storage structure is shown in the following figure.

Message push triggered by the client or component service, through the data structure stored in Redis, the WS-API service queries the Socket ID of the target client that returns the message body, and then the WS-Gateway service consumes the message in the cluster. If the Socket ID is not on the current node, you need to query the relationship between the node and the session to find the WS-Gateway node corresponding to the Socket ID of the client. Generally, the following two solutions are available (as shown in the following figure).

After deciding to use event broadcast for messaging between gateway nodes, we further choose which specific messaging middleware to use. Three alternative schemes are listed (as shown in the figure below).

Therefore, 100W joining and leaving operations were carried out on Redis and other MQ middleware. During the test, it was found that Redis performed very well when the data was less than 10K.

** Further combined with the actual situation: ** The data volume of broadcast content is about 1K, the business scene is simple and fixed, and compatible with historical business logic, and finally Redis is selected for message broadcast.

Subsequently, WS-API and WS-Gateway can be connected in pairs, and gRPC Stream can be used to save Intranet traffic.

4.7 Heartbeat Mechanism

After the session is stored in the node memory and Redis, the client needs to continuously update the time stamp of the session through heartbeat reporting. The client reports the heartbeat time stamp according to the period sent by the server. The time stamp is updated in the memory first, and then Redis synchronization in another period. Avoid the pressure on Redis caused by a large number of clients reporting heartbeat simultaneously.

Specific process:

  • 1) After the WebSocket connection is successfully established on the client, the server delivers heartbeat reporting parameters.

  • 2) The client transmits heartbeat packets according to the above parameters, and the server updates the session timestamp after receiving the heartbeat.

  • 3) Other upstream data of the client will trigger the corresponding session timestamp update;

  • 4) The server periodically clears the timeout session and performs the active shutdown process;

  • 5) Clear WebSocket connection and the relationship between users and files through Redis updated timestamp data.

Session data memory and Redis cache cleanup logic:

for{

select{

case<-t.C:

var now = time.Now().Unix()

var clients = make([]*Connection, 0)

dispatcher.clients.Range(func(_, v interface{}) bool{

client := v.(*Connection)

lastTs := atomic.LoadInt64(&client.LastMessageTS)

if now-lastTs > int64(expireTime) {

clients = append(clients, client)

} else{

dispatcher.clearRedisMapping(client.Id, client.Uid, lastTs, clearTimeout)

}

return true

})

for_, cli := rangeclients {

cli.WsClose()

}

}

}

Based on the existing two-level cache refresh mechanism, the server performance pressure caused by heartbeat reporting is reduced by using dynamic heartbeat reporting frequency. In the default scenario, the client reports heartbeat reports to the server at an interval of 1s. Assume that the current single machine carries 50w connections, and the current QPS is: QPS1 = 500,000/1.

From the perspective of server performance optimization, the dynamic interval under normal heartbeat is realized. Every x normal heartbeat is reported, the heartbeat interval is increased by a and the upper limit is increased by Y. The minimum dynamic QPS value is QPS2=500000/y.

In the extreme case, the QPS produced by heartbeat is reduced by a factor of y. When a heartbeat timeout occurs, the server immediately changes the value of A to 1s and tries again. The preceding policies ensure the connection quality and reduce the performance loss caused by heartbeat on the server.

4.8 Customizing Headers

The purpose of customizing Headers with Kafka is to avoid the performance cost of decoding the message body at the gateway layer.

After the client WebSocket connection is successfully established, a series of service operations will be performed. We choose to place the operation instructions between WS-Gateway and WS-API and necessary parameters in Kafka Headers. For example, x-XX-operator is broadcast. Then read the X-XX-GUID file number and push messages to all users in the file.

Kafka Headers writes a trace ID and a timestamp to trace the entire consuming link of a message and the elapsed time of each phase.

4.9 Receiving and Sending Messages

type Packet struct{

.

}

type Connect struct{

*websocket.Con

send chanPacket

}

func NewConnect(conn net.Conn) *Connect {

c := &Connect{

send: make(chanPacket, N),

}

goc.reader()

goc.writer()

return c

}

The first version of client-server message interaction is written similarly.

It is found that each WebSocket connection will take up 3 Goroutines. Each Goroutine requires a memory stack, and the stand-alone connection is very limited.

Given the large memory footprint and the fact that c. riter() is idle most of the time, consider whether to enable only two Goroutines to complete the interaction.

type Packet struct{

.

}

type Connect struct{

*websocket.Conn

mux sync.RWMutex

}

func NewConnect(conn net.Conn) *Connect {

c := &Connect{

send: make(chanPacket, N),

}

goc.reader()

return c

}

func(c *Connect) Write(data []byte) (err error) {

c.mux.Lock()

deferc.mux.Unlock()

.

return nil

}

Keep the goroutine of c.reader(). If polling is used to read data from the buffer, there may be read delay or lock problems. The c.reader() operation is adjusted to active call, instead of starting goroutine continuous listening, which reduces memory consumption.

The lightweight and high-performance event-driven network libraries such as GEV and GNET were investigated, and the problem of message delay was found in a large number of connection scenarios, so they were not used in the production environment.

4.10 Core object Caching

After determining the data receiving and sending logic, the core object of the gateway part is the Connection object, and the functions such as run, read, write and close are developed around Connection.

Sync.pool is used to cache this object to relieve GC pressure, and the Connection object is obtained from the object resource pool when the Connection is created.

After the life cycle ends, the Connection object is reset and Put back to the resource pool.

In actual coding, it is recommended to encapsulate GetConn() and PutConn() functions, and converge operations such as data initialization and object reset.

var ConnectionPool = sync.Pool{

New: func() interface{} {

return &Connection{}

},

}

func GetConn() *Connection {

cli := ConnectionPool.Get().(*Connection)

return cli

}

func PutConn(cli *Connection) {

cli.Reset()

Connectionpool. Put(cli) // Put back the ConnectionPool

}

4.11 Optimization of data transmission process

In the process of message flow, it is necessary to consider the optimization of the transmission efficiency of the message body, and adopt MessagePack to serialize the message body and compress the message body size. Adjust the MTU value to avoid subcontracting. Define A as the detection packet size, and run the following command to detect the MTU limit of the target service IP address.

ping-s {a} {ip}

When a is 1400, the actual size of the transmitted packet is 1428.

28 consists of 8 (ICMP echo request and echo reply packet format) and 20 (IP header).

If a is too large, the response times out, and the actual environment package size exceeds this value, subcontracting will occur.

While debugging the appropriate MTU value, the message body is serial number through MessagePack to further compress the size of the packet and reduce the CPU consumption.

4.12 Infrastructure support

Use EGO framework for service development, including service log printing, asynchronous log output, and dynamic log level adjustment, which facilitates online troubleshooting and improves log printing efficiency. Microservice monitoring system, CPU, P99, memory, Goroutine and other monitoring.

Client Redis monitoring:

Client Kafka monitoring:

Custom monitor:

5, check the results of the moment: performance pressure test

5.1 Preparation for pressure test

The test platform is as follows:

  • 1) select a 4-core, 8-gb vm as the service machine. The target vm bears 48w connections.

  • 2) Select eight 4-core 8G VMS as clients, with 6W ports open for each client.

5.2 Simulation Scenario 1

Online users, 50 million online users.

The peak number of connections established by a single WS-Gateway per second is 1.6W /s, and the memory occupied by each user is 47K.

5.3 Simulation Scenario 2

The test time is 15 minutes, 50 million online users, every 5s push to all users, users have a receipt.

The push content is:

42[“message”,{“type”:”xx”,”data”:{“type”:”xx”,”clients”:[{“id”:xx,”name”:”xx”,”email”:”[email protected]”,”avatar”:”ZgG5kEjCkT6mZ la6.png”,”created_at”:1623811084000,”name_pinyin”:””,”team_id”:13,”team_role”:”member”,”merged_into”:0,”team_time”:16238 11084000,”mobile”:”+xxxx”,”mobile_account”:””,”status”:1,”has_password”:true,”team”:null,”membership”:null,”is_seat”:tru e,”team_role_enum”:3,”register_time”:1623811084000,”alias”:””,”type”:”anoymous”}],”userCount”:1,”from”:”ws”}}]

After 5 minutes, the service is restarted abnormally because the memory usage exceeds the upper limit.

Analyze the cause of exceeding the memory limit:

New broadcast code using 9.32% of memory:

The part that receives the user’s receipt message consumes 10.38% of memory:

Adjust the test rules. The test time is 15 minutes. 48W online users are tested.

The push content is:

42[“message”,{“type”:”xx”,”data”:{“type”:”xx”,”clients”:[{“id”:xx,”name”:”xx”,”email”:”[email protected]”,”avatar”:”ZgG5kEjCkT6mZ la6.png”,”created_at”:1623811084000,”name_pinyin”:””,”team_id”:13,”team_role”:”member”,”merged_into”:0,”team_time”:16238 11084000,”mobile”:”+xxxx”,”mobile_account”:””,”status”:1,”has_password”:true,”team”:null,”membership”:null,”is_seat”:tru e,”team_role_enum”:3,”register_time”:1623811084000,”alias”:””,”type”:”anoymous”}],”userCount”:1,”from”:”ws”}}]

Connection establishment peak: 1W /s; received data peak: 9.6W /s; sent data peak: 9.6W /s.

5.4 Simulation Scenario 3

The test time is 15 minutes, 50W online users, every 5s push to all users, users do not need receipt.

The push content is:

42[“message”,{“type”:”xx”,”data”:{“type”:”xx”,”clients”:[{“id”:xx,”name”:”xx”,”email”:”[email protected]”,”avatar”:”ZgG5kEjCkT6mZ la6.png”,”created_at”:1623811084000,”name_pinyin”:””,”team_id”:13,”team_role”:”member”,”merged_into”:0,”team_time”:16238 11084000,”mobile”:”+xxxx”,”mobile_account”:””,”status”:1,”has_password”:true,”team”:null,”membership”:null,”is_seat”:tru e,”team_role_enum”:3,”register_time”:1623811084000,”alias”:””,”type”:”anoymous”}],”userCount”:1,”from”:”ws”}}]

Peak number of connections established: 1.1w /s; peak number of data sent: 10w /s. There are no exceptions except high memory usage.

Very high memory consumption, analysis of flame chart, most of the consumption in the timing 5s broadcast operation.

5.5 Simulation Scenario 4

The test time is 15 minutes, 50 million online users, every 5s push to all users, users have a receipt. 4w users log on and off every second.

The push content is:

42[“message”,{“type”:”xx”,”data”:{“type”:”xx”,”clients”:[{“id”:xx,”name”:”xx”,”email”:”[email protected]”,”avatar”:”ZgG5kEjCkT6mZ la6.png”,”created_at”:1623811084000,”name_pinyin”:””,”team_id”:13,”team_role”:”member”,”merged_into”:0,”team_time”:16238 11084000,”mobile”:”+xxxx”,”mobile_account”:””,”status”:1,”has_password”:true,”team”:null,”membership”:null,”is_seat”:tru e,”team_role_enum”:3,”register_time”:1623811084000,”alias”:””,”type”:”anoymous”}],”userCount”:1,”from”:”ws”}}]

The peak value of connection number establishment: 18570 /s, the peak value of received data: 329949 /s, and the peak value of sent data: 393542 /s. No abnormal situation occurred.

5.6 Summary of pressure measurement

Under the hardware condition of 16-core 32G memory: the number of connections in a single machine is 50W, and the above four scenarios, including user online and offline, message receipt, are tested. The memory and CPU consumption are in line with expectations, and the service is stable under a long time of testing.

The results of the test can basically meet the requirements of resource saving under the current level, and we believe that we can continue to improve the function development on this basis.

6. Summary of this paper

Facing the increasing number of users, the reconstruction of gateway service is imperative.

This reconstruction mainly includes:

  • 1) Decouple gateway services from business services, remove the dependence on Nginx, and make the overall architecture clearer;

  • 2) Analyze the overall process from user connection to bottom business push message, and optimize these processes in detail.

The 2.0 version of the long connection gateway has less resource consumption, lower memory loss per user, more perfect monitoring and alarm system, so that the gateway service itself is more reliable.

The above optimization content is mainly the following aspects:

  • 1) Scalable handshake process;

  • 2) Socket ID production;

  • 3) Optimization of client heartbeat processing process;

  • 4) User-defined Headers avoids message decoding and strengthens link tracking and monitoring;

  • 5) Optimization of code structure design for receiving and sending messages;

  • 6) The use of object resource pool, using cache to reduce GC frequency;

  • 7) Serialization compression of message body;

  • 8) Access service observation infrastructure to ensure service stability.

In addition to ensuring the gateway service performance, it is further to converge the way of calling gateway service from the underlying component services. The former HTTP, Redis, Kafka and other methods are unified into gRPC call, which ensures that the source can be checked and controlled, and lays a better foundation for the subsequent service access.

7. Related articles

[1] WebSocket from entry to master, half an hour is enough!

[2] Understanding modern Web instant messaging technology is enough: WebSocket, socket. IO, SSE

[3] From Guerrilla to regular Army (III) : Technical practice of distributed IM system for Hornet’s Nest Tourism Network based on Go

[4] 12306 grab tickets to bring enlightenment: see how I use Go to achieve millions of QPS seconds kill system (source code included)

[5] Practice of building ten-million-level online High Concurrent message push System with Go language (from 360 Company)

[6] Learn IM with Source code (6) : How to use Go to quickly build high performance, extensible IM systems

(This article has been simultaneously published at: www.52im.net/thread-3757…)