I. Explanation of nouns

  • Unicast: The server sends messages to a single client user
  • Multicast: The server sends messages to multiple client users
  • Multicast/broadcast: The server sends messages to a group of clients. There are group ids to identify this group of users
  • Uplink message: The server sends a message to a group of clients. There are group ids to identify this group of users
  • Downlink message: The server sends a message to the client

Second, system architecture

  • Proxy: The proxy is deployed in an edge equipment room, and the client accesses the server using the intelligent DNS
  • LogicService: processes authentication, heartbeat, online and offline, and group access
  • PushService: a unicast broadcast that receives a message and forwards it to Comet, which then sends the message back
  • ImService: chat server, processing single chat group chat, offline message
  • CosumerService: asynchronous write diffusion of group messages
  • AuthService: indicates the authentication service

The data structure

CacheService maintains a global online user, which is a level-2 map user_id -> conn_id -> server_id.

  • The user_id is service-specific and uniquely identifies a user
  • Conn_id Is assigned by a storage process and uniquely identifies a connection for the user
  • Server_id Specifies the access process to which the connection belongs

Proxy maintains its own online users. User_id +conn_id -> Connection

  • A Connection is an encapsulation of a client Connection to which messages can be pushed

Proxy maintains the connection room information, room_id -> ConnectionList

Message model

3.1 Read diffusion model

  • Advantages: Write only once, reducing the number of writes, especially in group mode
  • Disadvantages: The logic of synchronizing messages is complicated, and the receiver needs to read the messages once for each session. In this case, many invalid requests will be generated.

3.2 Write diffusion model

  • Advantages: Simple logic for pulling messages
  • Weakness: enlarged write, single chat to write extra two times, group to write N times

4. Implementation method

4.1 single chat

4.1.1 Design objectives

4.1.2 Online Message

4.1.3 Offline Message

4.1.4 Detecting message Loss

4.2 group chat

4.2.1 Design objectives

4.2.2 Small Groups (write diffusion)

4.2.3 Large group (Read Diffusion)

Five, high performance analysis

Bottleneck CPU > Bandwidth > Memory

5.1 capacity planning :(ali cloud host 16c32g-2.5ghz, 50% reserve)

  • 10000 conn per proxy
  • 100 proxy
  • 50 logicService/cacheService/pushService
  • The or improvement:
    • 10 logicService
    • 5 pushService
    • kafka cluster
    • zookeeper cluster
    • 10 cacheService

5.2 There is no internal communication bottleneck. The following paths can be horizontally expanded:

  • RPC Mobile -> Proxy -> Micro initiated by the client
  • Go online/Go offline/Switch rooms/Heartbeat Mobile > Proxy > logicService > cacheService
  • Unicast Micro -> logicService (-> cacheService) -> pushService -> Proxy -> Mobile
  • Online information query
    • Check room /session online by user

5.3 Paths with Possible Bottlenecks in Internal Communication:

  • Batch unicast micro -> logicService ((N-PARALLEL)-> Router)-> pushService -> Proxy -> Mobile
    • Limit: the total number of users in a batch, not too many
  • Broadcast Micro -> logicService -> pushService -> Proxy -> Mobile
    • Limitation: The number of PushServices cannot be too large because the pushService is periodically added to the Room list of the Absorb Proxy
    • Improved: logicService and pushService are decoupled using Kafka. As the pushService CPU consumption at least in the proxy/logicService/cacheService, only need very little pushService instance.
  • Online information query
    • Total number of online entries /count logicService periodically generates room users on the cacheService, and only a limited number of logicService entries can be queried periodically
    • Query users by room /room and /count
    • Interface for traversal /list debugging, not for service

5.4 Proxy Performance Bottlenecks

5.5 RPC Performance Bottlenecks

High availability analysis

Provide users with 7-24 hours uninterrupted service. Iterative development requires the upgrade and expansion of internal modules and business services without user awareness.

  • The proxy has no stateless service. When the proxy is restarted or upgraded, the client detects that the connection is disconnected and automatically reconnects to another proxy
  • LogicService Stateless service. When the proxy restarts or upgrades, it automatically searches for the next logic
  • PushService Stateless service. Other PushServices provide services during restart or upgrade
  • CacheService is a stateful service. During a restart or upgrade, the standby cacheService takes over. After the upgrade is complete, the primary cacheService is switched back
  • ImService is stateless. Other PUSHServices provide services during restart or upgrade
  • Mysql: The mysql master master mechanism is used to ensure this
  • Redis: Use the sentinel mechanism to ensure availability

7. Handling of abnormal situations

  • How to Prevent Message Loss (The receiving end reports the maximum RECEIVED message ID, and the abnormal server resends the message)
  • Redis primary/secondary switchover causes self-added ids to be discontinuous
  • How to improve proxy broadcast performance
  • How to avoid the bottleneck of single RPC connections

Viii. Low cost and safety

  • Almost no external dependence, very low operation and maintenance costs
  • High-performance code implementation, saving server costs
  • Integrated authentication and authentication, and HTTPS is supported

(after)