This article is from zhihu technology column of official technical team of Zhihu. Thanks to faceair for sharing it selflessly.

1, the introduction

Real-time response is always exciting. For example, you can see that the other party is typing in wechat, you can see that the other party is responding in King Valley, and you can see the same 666 in live live barrage. They are all supported by long connection technology. Almost every Internet company has a set of long-link systems, which are used for notifications, instant messaging, push, live barrage, games, shared location, stock tickers, and more. As companies grow larger and business scenarios become more complex, it is more likely that multiple businesses will need to use long-connected systems at the same time. Designing long connections separately between services can lead to steep development and maintenance costs, waste of infrastructure, increased client power consumption, and failure to reuse existing experience. Sharing a long-link system also requires coordination of authentication, authentication, data isolation, protocol expansion, message delivery guarantee and other requirements between different systems. In the iterative process, protocols need to be forward compatible. Meanwhile, capacity management becomes more difficult because long-links of different services converge into one system. After more than a year of development and evolution, through our services for a number of inside and outside the App, access to a dozen requirements and forms the long connection of business message sending a massive outbreak, hundreds of millions of equipment online at the same time, and so on scene, we made a long connection gateway system of general solution, solves the business more common problems encountered in the process of long connection. Zhihu long connection gateway is committed to service data decoupling, efficient message distribution, capacity problem solving, and to provide a certain degree of message reliability guarantee.

(This article was published simultaneously at:www.52im.net/thread-2737…)



2. Related articles

  • “Netty Dry Goods Sharing: Jingdong Jingmai production grade TCP Gateway Technology Practice Summary”
  • Absolute Dry Goods: Technical Essentials of Netty-based Mass Access Push Service
  • In – Depth understanding of TCP protocol (part 2) : RTT, sliding Windows, congestion Handling
  • “High Performance Network Programming part II: The Famous C10K Concurrent Connection Problem in the Last Decade”
  • High Performance Network Programming (PART 3) : The Next 10 Years, It’s Time to Consider C10M Concurrency
  • “High Performance Network Programming (IV) : Theoretical Exploration of High Performance Network Applications from C10K to C10M”
  • Zhihu Technology Sharing: Redis High Performance Cache Practice Path from Single machine to 20 million QPS Concurrency

3. How do we design communication protocols?

3.1 Service decoupling

A long-link gateway that supports multiple services actually connects to multiple clients and multiple service backends at the same time. It is a many-to-many relationship, and only one long-link is used for communication between them.







Such many-to-many systems are designed to avoid strong coupling. The business logic is also dynamically tuned, and if the protocol and logic of the business are coupled with the gateway implementation, all the business will be involved with each other and the protocol upgrade and maintenance will be extremely difficult.



Therefore, we try to use the classic publish-subscribe model to decouple the long-connection gateway from the client and the business back end, and they only need the convention Topic to publish subscription messages to each other freely. The messages transmitted are pure binary data, and the gateway does not care about the protocol specification and serialization of the business side.







3.2 Permission Control

We decoupled the gateway from the business side implementation using publish-subscribe, and we still needed to control the client’s publish-subscribe permissions to the Topic to avoid intentional or unintentional data contamination or unauthorized access.



If the lecturer is giving a lecture on Channel 165218 of Zhihu Live, when the client enters the room and tries to subscribe to the Topic of Channel 165218, the back end of Zhihu Live needs to judge whether the current user has paid. The permissions in this case are actually quite flexible, with users being able to subscribe when they pay, but not otherwise. The status of permissions is known only by zhihu Live business back-end, and the gateway cannot independently make judgment.



Therefore, we designed callback based authentication mechanism in ACL rules, which can configure the subscription and publishing actions of live-related topics to be judged by HTTP callback to the back-end service of Live.







At the same time, according to our observation of internal business, in most scenarios, what businesses need is only a private Topic of the current user to receive notifications or messages sent by the server. In this case, it would be tedious to design callback interfaces for businesses to determine permissions.



Therefore, we designed Topic template variables in ACL rules to reduce the access cost of the business side. We configured the Topic that the business side allowed to subscribe to contain the identification of connected user name variables, indicating that only users were allowed to subscribe or send messages to their own Topic.







In this case, the gateway can independently and quickly determine whether the client has permission to subscribe or send messages to a Topic without communicating with the business side.



3.3 Message Reliability Assurance


As the hub of message transmission, the gateway interconnects with the service back-end and client. Therefore, the reliability of message transmission must be ensured during message forwarding.



TCP can only ensure the sequence and reliability of transmission. However, when the TCP status is abnormal, the client receiving logic is abnormal, or a Crash occurs, messages in transmission will be lost.



To ensure that the sent or upstream messages are normally processed by the peer end, we implement the return receipt and retransmission functions. After receiving and processing important service messages, the client needs to send an acknowledgement receipt. However, the gateway stores the messages that are not received by the client temporarily. The gateway checks whether the client receives the message and tries to send the message again until the client receives the acknowledgement receipt correctly.







However, in the case of heavy traffic on the server, it is inefficient to send a receipt for each message sent by the server to the gateway. We also provide a message queue-based receiving and sending mode, which will be elaborated later in the implementation of publish and subscribe.



When designing the communication protocol, we refer to the MQTT specification (seeLiteracy stickers: Understand THE MQTT communication protocol”, the authentication and authentication design is expanded, the isolation and decoupling of business messages are completed, and a certain degree of transmission reliability is guaranteed. At the same time, it is compatible with MQTT protocol to a certain extent, so that we can directly use the client-side implementation of MQTT and reduce the cost of business access.

4. How do we design the system architecture?

When designing the overall architecture of the project, our priorities are:

  • 1) Reliability;
  • 2) Horizontal expansion ability;
  • 3) Dependent on component maturity;
  • 4) Be trustworthy if it’s simple.


In order to ensure reliability, we did not consider the traditional long-connected system to centralized internal data storage, computing, message routing and other components in a large distributed system maintenance, which increases the complexity of system implementation and maintenance. We tried to separate the components of these parts, leaving storage and message routing to a professional system to make the functions of each component as simple and clear as possible.



We also need to scale horizontally quickly. In the Internet scenario, all kinds of marketing activities may lead to a sharp increase in the number of connections. At the same time, the number of messages delivered in the publish and subscription model system will increase linearly with the number of Topic subscribers. At this time, the storage pressure of unreceived messages on the temporary gateway client also multiplies. By breaking down the components and reducing in-process state, we can deploy services into containers that allow for rapid and almost unlimited horizontal scaling.



The system architecture of the final design is shown as follows:





The system consists of four main components:

  • 1) The access layer is implemented by OpenResty, which is responsible for connection load balancing and session persistence;
  • 2) Long-connected brokers, deployed in containers, responsible for protocol parsing, authentication and authentication, sessions, publish and subscribe logic;
  • 3) Redis stores and persists session data;
  • 4) Kafka message queue, which distributes messages to brokers or business parties.

Kafka and Redis are the basic components widely used in the industry. They have been platformized and containerized in Zhihu (see Redis at Zhihu, Design and Implementation of Kafka Platform based on Kubernetes of Zhihu), and they can also complete the rapid capacity expansion at the minute level.

5. How do we build long-connect gateways?

5.1 access layer

OpenResty is the most widely used Nginx extension solution for Lua support in the industry. It is flexible, stable, and has excellent performance, and we considered using OpenResty for the access layer solution selection. The access layer is the side closest to the user, and at this layer two things need to be done:

  • 1) Load balancing to ensure that the number of connections on all long-connected Broker instances is relatively balanced;
  • 2) Session persistence, in which a single client connects to the same Broker each time, is used to ensure the reliability of message transmission.

In fact, there are many algorithms for load balancing, whether random or various Hash algorithms can be better implemented, the more troublesome is session persistence. A common four-tier load balancing strategy is to Hash consistently based on the IP address of the connection source. This ensures that the number of nodes remains the same and that the previously connected nodes are likely to be found even if the number of nodes changes slightly. We have used the source IP Hash policy before, and it has two main disadvantages:

  • 1) The distribution is not uniform. Some source IP addresses are NAT egress of large LAN. The number of connections on these IP addresses is large, resulting in unbalanced connections on the Broker.
  • 2) If the client is not identified correctly, the network may not be able to connect back to the Broker when the mobile client drops the line.

Therefore, we consider the seven-layer load balancing, and conduct consistent Hash based on the unique identifier of the client, so that randomness is better, and the correct routing can also be ensured after network switching. The conventional method is to completely parse the communication protocol and then forward it according to the packet of the protocol, which costs a lot and increases the risk of protocol parsing errors. In the end, we chose to leverage the PREREAD mechanism of Nginx for seven-tier load balancing, which is less intrusive to the subsequent implementation of long-connected brokers and has a low resource overhead for the access layer. Nginx accepts a connection by specifying that it pre-reads the connection data into the PRERead buffer, and we extract the client id by parsing the first message sent by the client in the PRERead buffer. This client identity is then used for a consistent Hash to get the fixed Broker.

5.2 Publish and Subscribe

We introduced the widely used message queue Kafka as a hub for internal message transmission. There are a few reasons for this:

  • 1) Reduce the internal state of the long-connected Broker so that it can expand without pressure;
  • 2) Zhihu has been platformized internally and supports horizontal expansion.

Other reasons are:

  • 1) Use message queue to cut peak to avoid sudden up or down messages crushing the system;
  • 2) Kafka is widely used in business systems to transmit data and reduce the cost of docking with business parties.

Using message queue peaking is easy to understand. Let’s take a look at how Kafka can be used to better connect with the business side.

5.3 release

The connected Broker publishes messages to Kafka topics based on the routing configuration, and consumes Kafka to send messages to subscribing clients based on the subscription configuration.



When routing rules and subscription rules are configured separately, there are four possible situations.



Scenario 1: Messages are routed to Kafka Topic but not consumed. This scenario is suitable for data reporting, as shown in the following figure.







Scenario 2: Messages are routed to Kafka Topic, which is also consumed, in a common IM scenario, as shown in the figure below.







Scenario 3: Consume and deliver messages directly from Kafka Topic, as shown in the following figure.







Case 4: Messages are routed to one Topic and then consumed from another Topic for scenarios where messages need to be filtered or preprocessed, as shown in the figure below.







The design flexibility of this routing strategy is very high, and it can solve the requirements of message routing in almost all scenarios. Also, because publish and subscribe is based on Kafka, the message reliability can be guaranteed when processing large-scale data.



5.4 the subscription

When the long-connected Broker consumes a message from a Kafka Topic, it looks up the local subscription and distributes the message to the client session. We initially used HashMap directly to store client subscriptions. When a client subscribes to a Topic, we put the session object of the client into the subscription Map with the Topic as the Key. When we check the subscription relationship of the message, we directly use the Topic to value from the Map. Because the subscription is a shared object, connection attempts are made to the shared object when subscription and unsubscription occur. We added a lock to the HashMap to avoid concurrent writes, but the global lock conflicts were serious and severely impacted performance. Finally, we refine the granularity of locks through sharding and disperse the conflicts of locks. Hundreds of hashmaps are created locally at the same time. When data needs to be accessed on a Key, one of the hashmaps is found through Hash and modulus and then operated. In this way, the global lock is distributed among hundreds of Hashmaps, which greatly reduces operation conflicts and improves overall performance.

5.5 Session Persistence

After a message is sent to the Session object, the Session controls the delivery of the message. The Session determines whether the message is an important Topic message. If yes, it marks the message with a QoS level of 1, stores the message to the unreceived message queue of Redis, and sends the message to the client. Wait until the client ACK the message before deleting the message from the unacknowledged queue. Some industry solutions maintain a list in memory that cannot be migrated during capacity expansion or downsizing. There are also industry solutions that maintain a distributed memory store in a long-connected cluster, which can be more complex to implement. We put unacknowledged message queues in external persistent storage to ensure that if a single Broker goes down, clients can recover Session data when they come back online and connect to other brokers, reducing the burden of capacity expansion and scaling.

5.6 Sliding Windows

When sending a message, each QoS 1 message needs to be transmitted, processed by the client, and sent back to confirm the delivery. The path takes a long time. If the number of messages is large, each message waits for such a long confirmation before being sent to the next one. As a result, the bandwidth of the delivery channel cannot be fully utilized.



In order to ensure the efficiency of sending, we designed the parallel sending mechanism by referring to the sliding window of TCP (see”Easy to Understand – In-depth understanding of TCP (part 2) : RTT, sliding window, and congestion handling”). We set a certain threshold as a sliding window for sending, indicating that there can be so many messages being transmitted and waiting for confirmation on the channel at the same time.







The sliding Windows we designed for the application layer are actually a little different from the SLIDING Windows of TCP.



The IP packets in the TCP sliding window cannot be guaranteed to arrive in sequence, and our communication is based on TCP, so the business messages in our sliding window are in sequence. Only when the connection status is abnormal or the client logic is abnormal, the messages in some Windows may be out of order.



The TCP protocol ensures the sequence of receiving messages. Therefore, there is no need to retry a single message during normal sending. The unacknowledged messages in the window are resended only after the client reconnects. At the same time, the receiving end reserves a buffer of window size for message deduplication to ensure that the service party does not receive duplicate messages.



The sliding window we built based on TCP ensures the order of messages and greatly improves the throughput of transmission.

Write at the end

Zhihu long Connection gateway is developed and maintained by Infra, and the main contributors are @Faceair and @Anjiangze. Infrastructure group is responsible for the flow of zhihu entrance and internal infrastructure, foreign we struggle in the face to face with the mass flow rate of the first front, internally we provide a rock-solid infrastructure for all of the business, users of each visit, each request, the Intranet every call is closely related to our system.

Appendix: More information on network programming

[1] Network programming basics: TCP/IP Detail – Chapter 11 ·UDP: User Datagram Protocol Transmission control Protocol “TCP/IP detail – chapter 18 ·TCP connection establishment and termination” “TCP/IP detail – Chapter 21 ·TCP timeout and retransmission” “Technology past: TCP/IP protocol changed the world (precious many pictures, mobile phone careful points)” “Easy to understand – in-depth understanding of TCP protocol (1) : Theoretical Basis: Simple to Understand – In-depth understanding of TCP (PART 2) : RTT, Sliding Window, Congestion Processing theory Classics: DETAILED explanation of TCP three-handshake and four-wave process Theory and Practice: Wireshark Packet Capture analyzing TCP Three-way handshake and Four-way wave Processing Computer Network Communication Protocol Diagram What is the Maximum Size of a UDP Packet? P2P technology details (a) : NAT – detailed principle, P2P introduction “P2P technology details (b) : P2P through (hole) scheme details” “P2P technology details (c) : P2P technology STUN, TURN, ICE details” “easy to understand: “High Performance Network Programming (1) : How many concurrent TCP connections can a single server have” “High performance network programming (2) : The last 10 years, the famous C10K concurrent connection problem” “High performance network programming (3) : In the next 10 years, it is time to consider the CONCURRENCY problem of C10M. High Performance Network Programming (4) : Theoretical Exploration of High performance Network Applications from C10K to C10M. “Unknown network programming (1) : Analysis of the TCP protocol in the difficult diseases (1)” “Unknown network programming (2) : Analysis of the TCP protocol in the difficult diseases (2)” “Unknown network programming (3) : TIME_WAIT and CLOSE_WAIT when closing TCP connections Unknown Network Programming (4) : In Depth analysis of TCP abnormal Shutdown Unknown Network Programming (5) : UDP Connectivity and Load Balancing Unknown Network Programming (6) : Understand UDP thoroughly and use it well. Unknown Network Programming (7) : How to Make UNRELIABLE UDP Reliable? “Unknown network programming (eight) : Deep decryption of HTTP from the data transfer layer” “Network programming lazy introduction (a) : a quick understanding of network communication protocol (PART I)” “Network programming lazy Introduction (two) : a quick understanding of network communication protocol (part II)” “Network programming lazy introduction (three) : A quick understanding of THE TCP protocol is enough “network programming lazy introduction (four) : a quick understanding of the difference between TCP and UDP” network programming lazy introduction (five) : a quick understanding of why UDP is sometimes more advantageous than TCP “network programming lazy introduction (six) : The history of the most popular hub, switch, router function principle introduction “network programming lazy people introduction (seven) : simple, a comprehensive understanding of the HTTP protocol” network programming lazy people introduction (eight) : Hand to teach you to write based on TCP Socket long connection “network programming lazy people introduction (nine) : Why use MAC addresses when you have IP addresses? “Technology Literacy: A New Generation of UdP-based Low Latency Network Transport Layer Protocol — A Full explanation” “Make The Internet Faster: A New generation of QUIC Protocol in Tencent’s technical Practice Sharing” “Modern Mobile terminal Network Short Connection optimization means summary: Request speed and weak network security, and adapt the talk iOS network programming in the long connection of those things, the mobile terminal IM developers required (a) : easy to understand, understand mobile network “weak” and “slow”, “the mobile end (2) : IM developers are required to read the history of most comprehensive mobile weak network optimization method summary” the IPv6 technology is a: Basic Concepts, Application Status, Technical Practice (Part 1) IPv6 Technical Details: Basic Concepts, Application Status, technical Practice (Part 2) from HTTP/0.9 to HTTP/2: An introduction to Brain-disabled Network programming (Part 1) : Introduction to Brain-disabled Network programming (2) : What are we reading and writing when we read and write sockets? Introduction to Network Programming (4) : A quick understanding of HTTP/2 Server Push “Introduction to Network programming (5) : Ping command used every day, What is it? Introduction to Network Programming (6) : What is public IP and internal IP? What is NAT? Take the Design of Network Access Layer of Online Game Server as an example to understand the Technical challenges of real-time Communication to The Advanced level: The Network Foundation that Excellent Android Programmers must Know and must Know Fully Understand the Miscellaneous Diseases of DNS domain name hijacking on Mobile Terminal: Technical Principles, Root Causes and Solutions “Android Developers must know and must know network communication Transport Layer protocols — UDP and TCP” “IM Developers must know zero Basic Communication Technology introduction (1) : 100 Years of Development of Communication switching Technology (1)” “IM developers zero Basic Communication technology introduction (2) : Communication history of the exchange of technology in one hundred (under) the IM introduction to developers of zero based communication technology (3) : Chinese communication mode of “one hundred change” IM developers based communication technology introduction of zero (4) : the evolution of mobile phones, most complete history of the evolution of mobile terminals. The IM developers based communication technology introduction of zero (5) : 1 g to 5 g, 30 years of mobile communication technology evolution, the IM developers based communication technology introduction of zero (6) : mobile terminal joint – “base station” technology “” IM developers based communication technology introduction of zero (7) : mobile terminal” electromagnetic waves “– a swift horse of the IM developers based communication technology introduction of zero (8) : Zero basis, the strongest in the history of “principle of the antenna,” literacy “” IM developers based communication technology introduction of zero (9) : wireless communication network center, the core network” the IM developers based communication technology introduction of zero (10) : zero foundation, the strongest in the history of 5 g technology literacy “” IM developers based communication technology introduction of zero (11) : Why is WiFi signal bad? Introduction to Basic Communication technology for IM Developers (12) : Access to network traffic? Network Down? Get it! Introduction to Zero-Base Communication technology for IM Developers (13) : Why Cell phone Signal is Bad? How hard is wireless Internet access on high-speed Trains? “Introduction to Zero-base COMMUNICATION Technology for IM Developers (15) : Understanding positioning technology, one is enough” baidu APP mobile terminal network in-depth optimization practice sharing (1) : DNS optimization “baidu APP mobile terminal network in-depth optimization practice sharing (2) : Network connection optimization “Baidu APP mobile terminal network depth optimization practice sharing (three) : mobile terminal weak network optimization” “technology master Chen Shuo share: from shallow to deep, network programming learning experience dry summary” may screw up your interview: do you know how many HTTP requests can be launched on a TCP connection? “Zhihu Technology Sharing: Zhihu High Performance Long Connection Gateway Technology Practice for Ten Million Level Concurrent” >> more similar articles…… [2] NIO Asynchronous Network Programming Materials: AIO Principles and Linux AIO Introduction, 11 Questions and Answers about Netty, Open Source NIO Framework: MINA or Netty first? Netty or Mina: In-depth Study and Comparison (I) Netty or Mina: In-depth Study and Comparison (II) Introduction to NIO Framework (I) : Server based UDP bidirectional communication Demo Demo (II) Introduction to NIO Framework (II) : NIO framework introduction (3) : iOS and MINA2, Netty4 cross-platform UDP two-way communication actual combat “NIO framework introduction (4) : Android and MINA2, Netty4 cross-platform UDP two-way communication actual “Netty 4.x learning (a) : ByteBuf details” “Netty 4.x learning (b) : Channel and Pipeline details” “Netty 4.x learning (c) : Detailed description of the Thread Model Apache Mina Framework advanced Part 1: IoFilter Detailed Description Apache Mina Framework Advanced Part 2: MINA2 Thread Principle Summary (including simple test examples) Apache MINA2.0 Development Guide (Chinese version) MINA and Netty source code (online reading version) has been compiled and released “Solving the problem of TCP sticky and missing packets in MINA Data Transfer (source code available)” “Solving the problem of multiple Filter instances of the same type co-existing in MINA” “Summary of Practice: The pits encountered in netty3. x upgrade (Thread)” “Summary of Practice: Netty3. X vs. Netty4. X thread model “details Netty security: principle introduction, code demo (part 1)” “Details Netty security: principle introduction, code demo (part 2)” “Details Netty elegant exit mechanism and principle” “NIO framework details: “Twitter: How to Use Netty 4 to Reduce GC Overhead for JVM” “Absolute Dry Goods: Key Technical Points for Implementing Mass Access Push Service based on Netty” “Netty Dry Goods sharing: Summary of Production class TCP Gateway Technology Practice in Jingmai” “Beginners: By far the most thorough Netty high performance principle and framework architecture analysis “for beginners: Java high performance NIO framework Netty learning methods and advanced strategies” “less verbose! One minute take you to understand the difference between Java NIO and classic IO” “history of the strongest Java NIO entry: For those worried about getting started and giving up, please read this! Netty network communication procedures for you to achieve the heartbeat mechanism, disconnection mechanism “>> more similar articles…

(This article is simultaneously published at: www.52im.net/thread-2737…)