Design and implementation of IM message system

I. Explanation of nouns

Unicast: The server sends messages to a single client user
Multicast: The server sends messages to multiple client users
Multicast/broadcast: The server sends messages to a group of clients. There are group ids to identify this group of users
Uplink message: The server sends a message to a group of clients. There are group ids to identify this group of users
Downlink message: The server sends a message to the client

Second, system architecture

Proxy: The proxy is deployed in an edge equipment room, and the client accesses the server using the intelligent DNS
LogicService: processes authentication, heartbeat, online and offline, and group access
PushService: a unicast broadcast that receives a message and forwards it to Comet, which then sends the message back
ImService: chat server, processing single chat group chat, offline message
CosumerService: asynchronous write diffusion of group messages
AuthService: indicates the authentication service

The data structure

CacheService maintains a global online user, which is a level-2 map user_id -> conn_id -> server_id.

The user_id is service-specific and uniquely identifies a user
Conn_id Is assigned by a storage process and uniquely identifies a connection for the user
Server_id Specifies the access process to which the connection belongs

Proxy maintains its own online users. User_id +conn_id -> Connection

A Connection is an encapsulation of a client Connection to which messages can be pushed

Proxy maintains the connection room information, room_id -> ConnectionList

Message model

3.1 Read diffusion model

Advantages: Write only once, reducing the number of writes, especially in group mode
Disadvantages: The logic of synchronizing messages is complicated, and the receiver needs to read the messages once for each session. In this case, many invalid requests will be generated.

3.2 Write diffusion model

Advantages: Simple logic for pulling messages
Weakness: enlarged write, single chat to write extra two times, group to write N times

4. Implementation method

4.1 single chat

4.1.1 Design objectives

4.1.2 Online Message

4.1.3 Offline Message

4.1.4 Detecting message Loss

4.2 group chat

4.2.1 Design objectives

4.2.2 Small Groups (write diffusion)

4.2.3 Large group (Read Diffusion)

Five, high performance analysis

Bottleneck CPU > Bandwidth > Memory

5.1 capacity planning :(ali cloud host 16c32g-2.5ghz, 50% reserve)

10000 conn per proxy
100 proxy
50 logicService/cacheService/pushService
The or improvement:
- 10 logicService
- 5 pushService
- kafka cluster
- zookeeper cluster
- 10 cacheService

5.2 There is no internal communication bottleneck. The following paths can be horizontally expanded:

RPC Mobile -> Proxy -> Micro initiated by the client
Go online/Go offline/Switch rooms/Heartbeat Mobile > Proxy > logicService > cacheService
Unicast Micro -> logicService (-> cacheService) -> pushService -> Proxy -> Mobile
Online information query
- Check room /session online by user

5.3 Paths with Possible Bottlenecks in Internal Communication:

Batch unicast micro -> logicService ((N-PARALLEL)-> Router)-> pushService -> Proxy -> Mobile
- Limit: the total number of users in a batch, not too many
Broadcast Micro -> logicService -> pushService -> Proxy -> Mobile
- Limitation: The number of PushServices cannot be too large because the pushService is periodically added to the Room list of the Absorb Proxy
- Improved: logicService and pushService are decoupled using Kafka. As the pushService CPU consumption at least in the proxy/logicService/cacheService, only need very little pushService instance.
Online information query
- Total number of online entries /count logicService periodically generates room users on the cacheService, and only a limited number of logicService entries can be queried periodically
- Query users by room /room and /count
- Interface for traversal /list debugging, not for service

5.4 Proxy Performance Bottlenecks

5.5 RPC Performance Bottlenecks

High availability analysis

Provide users with 7-24 hours uninterrupted service. Iterative development requires the upgrade and expansion of internal modules and business services without user awareness.

The proxy has no stateless service. When the proxy is restarted or upgraded, the client detects that the connection is disconnected and automatically reconnects to another proxy
LogicService Stateless service. When the proxy restarts or upgrades, it automatically searches for the next logic
PushService Stateless service. Other PushServices provide services during restart or upgrade
CacheService is a stateful service. During a restart or upgrade, the standby cacheService takes over. After the upgrade is complete, the primary cacheService is switched back
ImService is stateless. Other PUSHServices provide services during restart or upgrade
Mysql: The mysql master master mechanism is used to ensure this
Redis: Use the sentinel mechanism to ensure availability

7. Handling of abnormal situations

How to Prevent Message Loss (The receiving end reports the maximum RECEIVED message ID, and the abnormal server resends the message)
Redis primary/secondary switchover causes self-added ids to be discontinuous
How to improve proxy broadcast performance
How to avoid the bottleneck of single RPC connections

Viii. Low cost and safety

Almost no external dependence, very low operation and maintenance costs
High-performance code implementation, saving server costs
Integrated authentication and authentication, and HTTPS is supported

(after)