Design and Practice of Offline Message Push System Architecture of Ximalaya with 100 Mililion Users

This article is originally written by Li Qiankun from the Himalayan Technology Team. The original title is “Practice of Push System”. Thank you for your selfless sharing.

1, the introduction

1.1 What is offline message push

Offline messaging is a familiar requirement for IM developers. For example, the image below shows a typical offline IM notification.

1.2 It is really not easy to push the Android terminal offline

Mobile terminal offline message push involves only two terminals — iOS terminal and Android terminal. IOS terminal has nothing to say, and APNS is the only option.

In order to realize offline push, various technologies are emerging in an endless stream. With the continuous upgrading of the difficulty of maintaining life, there are fewer and fewer means of maintaining life. If you are interested, you can read the following articles I organized to have a feel (the articles are in chronological order, With the improvement of the difficulty of Andriod system maintenance, it is continuously advanced).

“The Ultimate Summary of App Preservation (1) : Dual Process Preservation Practice under Android6.0”

“The Ultimate Summarization of App Survival (2) : Survival Practices for Android6.0 and above (Process Prevention and Killing)”

“The Ultimate Summarization of App Survival (3) : Survival Practices for Android6.0 and above (Killed and Resurrected)”

Android P is Coming: The Real Nightmare of Backend App Survival and Tweeting

“A comprehensive review of the actual operation effect of the current Android background protection scheme (by 2019)”

“2020, Android backstage is still alive play? See how I elegant implementation!”

The Most Powerful Android in History: An In-depth Analysis of Tencent TIM’s Process Immortality Technology

“Android Process Immortality Technology Ultimate Revealing: The Underlying Principle of Process Killed and App’s Skills for Coping with Killed Process”

“From getting started to giving up on Android: Guide users to whitelist (with 7 models plus whitelist examples)”

These are just a few of the articles I’ve put together on this topic, with special attention to the last one, “From Getting Started to Giving Up on Android: How to Guide Users to Whitelist (with 7 examples of whitelists).” Yes, the current Android system has almost zero tolerance for apps to keep themselves alive, so almost all of those old ways of keeping themselves alive are broken in the new version.

It’s no longer possible to do it yourself, and you’ll have to do it anyway. How to do? According to the current best practice, it is the room-level push channel for the manufacturers of inoculated mobile phones. I won’t go into the details here, but you can read about the upcoming Android P: The Real Nightmare of Backend App Survival and Tweets.

In the era of self-maintenance and self-built push channels (here of course, I mean the Android side), the architecture design of the offline message push system is relatively simple. It just means that each terminal calculates a DeviceID, and the server carries out message transmission through its own built channel, and that’s all.

But in the self-built channel dead, can only rely on the push channel of manufacturers nowadays, Xiaomi, Huawei, Meizu, OPPO, Vivo (just a few mainstream) and so on, there are too many phone models, each push API, design specifications are different (don’t mention to me about the unified push alliance, I’ve been waiting for him to do that for three years — see “Unified Push Alliance on the Move”), which has led to the redesign of the offline push architecture to accommodate the new age of push technology.

1.3 How to design reasonably

So, for different manufacturers’ room-level push channels, how should we design the background push architecture reasonably?

The design of offline message push system shared in this article is not specifically aimed at IM products, but the general technical ideas are the same no matter how different the business layer is. I hope this sharing from Ximaya can bring some inspiration to you who are designing offline message push with a large number of users.

Recommend reading: Himalayan technology team to share another “long connection gateway technology topic (5) : Himalayan self-research 100 million level API gateway technology practice”, interested can also read.

Learning Exchange:

Instant Messaging/Push Technology Development Exchange 5 Group: 215477170

An Introduction to Mobile IM: One Beginner Is Enough: Developing Mobile IM from scratch

Open source IM framework source:
https://github.com/JackJiang2…

(synchronous published in this article: http://www.52im.net/thread-36.)

2. Technical background

First of all, let’s introduce the role of push system in the Ximalaya APP. The following figure is the push/notification of a news business.

Offline push is mainly to have a means to reach users when they do not open the APP, to maintain the presence of the APP and improve the daily life of the APP.

At present, we mainly use push services including:

1) Anchor broadcasting: the company has live broadcasting business. Anchor will send a push broadcasting reminder to all fans of the anchor when broadcasting
2) Album Update: There are a lot of albums on the platform, with a series of specific sounds under the album. For example, a novel is an album and the novel has many chapters. When the novel updates the chapters, a new reminder will be sent to all users who subscribe to this album:
3) Personalization, news business, etc.
If you want to send an offline push to a user, the system needs to have a channel between the user’s device and the system.

Those who have done this know that the self-built push channel requires APP to reside in the background (that is, the application “keep alive” mentioned in the introduction). However, mobile phone manufacturers generally adopt “aggressive” background process management strategy for power saving and other reasons, resulting in poor quality of self-built channels. The channel is currently maintained by “push service providers”, which means that the push system within the company does not send push directly to the user. (As mentioned in the post in the previous section, “Android P is Coming: The Real Nightmare of Backend App Survival”)

The offline push flow process in this case is as follows:

Several major domestic manufacturers (Xiaomi, Huawei, Meizu, OPPO, Vivo, etc.) have their own official push channels, but each interface is different, so some manufacturers, such as Xiaomi and Gepui, provide integrated interfaces. When sending, the push system is sent to the integrator, and then the integrator sends the push channel to the specific manufacturer according to the specific device, and finally sends it to the user.

When sending a push to a device, you must specify what content you want to send: title, message/body, and specify which device to send the push to.

We use token to identify a device, and the meaning of token is different in different scenarios. Within a company, a device is generally identified by UID or DEVICEID. For integrators and different manufacturers, there is also their own unique “number” for the device. It is responsible for the conversion of UID, DEVICEID to the integrator token.

3. Overall architecture design

As shown in the figure above, the push system as a whole is a queue-based streaming processing system.

On the right side of the figure above: is the main link. Each business party sends push to the push system through the push interface. The push interface will send data to a queue for consumption by conversion and filtering services. Conversion is the UID/DEVICEID conversion to TOKEN. Filters will be sent to the sending module after the conversion is processed, and finally sent to the integrator interface.

When the App is started: it will send a binding request to the server and report the binding relationship between UID/Deviceid and Token. The integrator notifies the push system through an HTTP callback when the token is invalidated due to uninstalling/reloading the App, etc. Each component sends a stream via Kafka to the company’s XStream real-time stream processing cluster, aggregates the data and drives it to MySQL, and Grafana provides a variety of reports for presentation.

4. Design of service filtering mechanism

Each business party can send push to the user without thinking, but the push system should be controlled, so the business message should be filtered selectively.

The filtering mechanism is designed to include the following (in order of support) :

1) User switch: APP supports configuring user switch. If the user turns off push, no push will be sent to the user’s device;
2) Copy scheduling: a user cannot receive duplicate copy, which is used to prevent the upstream business party from sending logic error;
3) Frequency control: each service corresponds to an MSG_TYPE, and set the maximum number of XX push messages within XX time;
4) Silent time: no push is sent to users from XX to XX every day, so as not to disturb users’ rest.
5) Hierarchical management: Hierarchical control is carried out from two dimensions of users and messages.

In response to point 5, specifically:

1) Each MSG /msg_type has a level, to give important/high level business more opportunities to send;
2) When users receive xx notifications a day, non-important messages are no longer sent to these users.

5. Multidimensional query under database and table

Most of the time, design is based on theory and experience, but in practice, there will always be a variety of specific problems.

Ximalaya now has more than 600 million users, and the corresponding device table of pushing system (recording mapping of UID/DEVICEID to TOKEN) also has a similar order of magnitude, so the device table is divided into database and table, and DEVICEID is taken as the sub-table column.

But in reality: there is often a query requirement based on UID/Token, so you also need to establish a mapping relationship between UID/Token and Deviceid. Because UID lookup scenarios are also frequent, the UID side table also has the same fields as the main table.

Because global push is done once or twice a day and there are special push for silent users (i.e., users who don’t use the APP very much), there is virtually no “hot spot” in terms of storage. Caching is used, but it is very limited and takes up a lot of space.

Multiple sub-tables and caching result in three or four copies of the data, different copies for different logics, frequent inconsistencies (the pursuit of consistency affects performance), and very complex query code with poor performance.

In the end, we chose to store the device data on the TIDB, which greatly simplified the code under the premise of sufficient performance.

6. Timeliness of special services

6.1 Basic Concepts The push system is queue-based and “first come, first push”. Most of the services do not require very high real-time, but broadcast services require delivery in half an hour, and news services are even more “desirable”, the faster the better.

If there is a huge amount of “album update” push waiting for processing in the news push queue, the album update service will seriously interfere with the delivery of news service.

6.2 Is this an isolation problem? Initially, we thought of this as an isolation issue: 10 consumer nodes, 3 dedicated to time-sensitive business, and 7 dedicated to general business. Queues were using RabbitMQ at the time, and Spring-Rabbit was modified to support routing messages to specific nodes based on an MSYTYPE.

The scheme has the following disadvantages:

1) Some machines are always busy while others are “sitting on their hands”;
2) When new services are added, additional configuration of mapping relationship between MsgType and consumer nodes is needed, which results in high maintenance cost;
3) RabbitMQ is implemented based on memory, which takes up a large amount of memory during push instantaneous peak, thus causing instability of RabbitMQ.

Later we realized that this was a priority issue: high-priority businesses/messages could jump queues, so encapsulating Kafka support priorities solved the problem of isolation. The concrete implementation is to create multiple topics, a topic represents a priority, encapsulating Kafka is mainly to encapsulate the logic of the consumer end (that is, to construct a PriorityConsumer).

Note: For simplicity of description, this article uses consumer.poll(num) to describe the use of consumer to fetch num messages, which is inconsistent with the real Kafka API, please note.

There are three ways to implement PriorityConsumer, which are described below.

1) Re-order after polling: Java has a priorityQueue or PriorityBlockingQueue based on memory. Kafka Consumer consumes the data normally and pushes the data pollled to the PriorityQueue again.

1.1) If a bounded queue is used, after the queue is full, no matter how high the priority of the message behind it is, it will not put into the queue, and the effect of “queue jumping” will be lost. 1.2) If you use unbounded queue, the messages that should be piled on Kafka will be piled into memory, and OOM risk is very high. 2) First pull the data of the high-priority topic: keep consuming as long as there is data, until there is no data to consume the lower-level topic. In the process of consuming a lower-level topic, if a higher-level topic message is found to arrive, the high-priority message is switched to consuming.

The implementation of this scheme is relatively complex, and in the push intensive periods such as the evening peak, it may lead to the complete loss of push opportunities for low-priority businesses.

3) From high to low priority, pull data in a cycle:

The logic of a loop is as follows:

consumer-1.poll(topic1-num);
cosumer-i.poll(topic-i-num);
consumer-max.priority.poll(topic-max.priority-num)

If topic1-num=topic-i-num=topic-max.priority-num, then the scheme has no priority effect. Topic1-num can be regarded as weight. We agree that topic-high-num =2 * topic-low-num. At the same time, all topics will be consumed, and “queue cutting effect” will be realized in the form of how much is consumed at one time. The details are also borrowed from the “sliding window” strategy to optimize the overall consumption performance of a priority topic when there is no message for a long time.

From this, we can see that the limitation problem is first understood as an isolation problem, then regarded as a priority problem, and finally transformed into a weight problem.

7. Filtering mechanism storage and performance issues

In our architecture, the main factors that affect the speed of push delivery are the TIDB query and filtering logic, and the filtering mechanism is divided into storage and performance issues.

Here, we take the frequency control limit of XX service “at most one message per hour” as an example for analysis.

Version 1 implementation: Redis KV structure is <deviceId_msgtype, has sent the number of push >.

Frequency control realization logic is as follows:

1) When sending, incr key, the number of sending is increased by 1;
2) If the limit is exceeded (the upper limit of sending times of > returned by INCR command), it will not be pushed;
3) If the expire key is not expired and the return value is 1, it means that the first message is sent to the Deviceid within the frequency control period of msgtype, then the expire key needs to set the expiry time.

The above scheme has the following disadvantages:

1) At present, the company has 60+ push business, 600 million + Deviceid, a total of 600 million *60 keys, occupying a huge space;
2) In many cases, handling a deviceId requires two directives: incr+expire.

To this end, our solution is:

1) Replace Redis with Pika (Disk-based Redis) so that the disk space can meet the storage requirements;
2) The Delegated System Architecture Group extends the Redis protocol to support the new EHASH architecture.

Ehash is based on Redis Hash modification and is a two-level map <key,field,value>. In addition to the key can set the validity period, the field can also support the validity period, and supports conditional set the validity period.

The storage structure of frequency control data is changed from <deviceId_msgtype,value> to <deviceId,msgtype,value>. In this way, for multiple msgtypes, deviceId is only saved once, which saves space.

Mysql > combine incr and expire into one directive: incr(key, field,expire);

1) When the Field does not set an expiry date, it will set an expiry date;
2) When the Field has not expired, the validity parameter is ignored.

Because the push system heavily uses the incr instruction, it can be regarded as a write instruction, and most scenes also use the pipeline to achieve the effect of batch write. We entrusted the system architecture team partners to optimize the write performance of pika, support “write mode” (optimize the relevant parameters under the write scene), and the QPS reached more than 10W.

The ehash structure also plays an important role in flow recording, such as <deviceId,msgId,100001002>, where 100001002 is a sample value for a data format that we agreed upon, The first, middle and last three parts (each of three bits) respectively represent the sending, receiving and clicking details of a message (msgId) against deviceId. For example, the first three bits “100” indicate that the message failed to be sent because it was in the silent period when it was sent.

Appendix: More newspush technology articles

IOS push service APNS detail: design ideas, technical principles and defects, etc., carrier pigeon team original: walking through the pit of message push (APNS) on iOS10 together, summary of Android message push: implementation principle, heartbeat maintenance, problems encountered, etc. Understanding MQTT Communication Protocol, A Complete Android Pushing Demo Based on MQTT Communication Protocol, Interview with IBM Technical Manager: The Formulation Process and Development Status of MQTT Protocol, etc., Asking for information on Android Message Pushing: Advantages and disadvantages of three schemes: GCM, XMPP and MQTT, Brief Analysis of Mobile Terminal Real-time Message Pushing Technology, Literacy Post: Brief Discussion on the Principle and Difference of iOS and Android Backstage Real-time Message Pushing, Absolute Good: Technical Essential Points of Pushing Service with Mass Access Based on Netty, Mobile Terminal IM Practice: Research on Google Message Push Service (GCM) (from WeChat) “Why don’t IM tools like WeChat and QQ use GCM to push messages?” “Technical Practice Sharing of Large Scale High Concurrency Architecture of Aurora Push System”, “From HTTP to MQTT: Overview of App Data Communication Practice Based on Location Service”, “Technical Practice Sharing of Meizu’s 25 Million Long Connection Real-time Message Push Architecture”, “Specialist Meizu Architect: Experience of massive long connection real-time message push system, in-depth talk about the small matter of Android message push, implementation of Hybrid mobile application message push practice based on WebSocket (including code examples), implementation ideas of a secure and extensible subscription/push service based on long connection, practice sharing: How to build a highly available mobile message push system? Practice of Go Language in Constructing Ten Thousand Online High Concurrency Message Push System (from 360 Company) The actual combat experience of ten billion level real-time message push, the practical road of real-time push technology of millions of online live-shooting barrage system, the road of evolution of the message push architecture of the open platform of Beijing and Tokyo Mac merchants, and the article of understanding iOS message push is enough: Detailed explanation of the most complete IOS Push technology in history, Implementation of IOS high-performance message Push (service part) based on APNS latest HTTP/2 interface, Decryption of “Dada – Jingdong Home” Order Instant Delivery Technology Principle and Practice, Technical Dry-Goods: From scratch, teach you to design a million-level message push system “Long connection gateway technology topic (four) : iQIYI WebSocket real-time push gateway technology practice” “Himalayan billion users of offline message push system architecture design practice”

More articles of the same kind…

This article has been published simultaneously on the public account of “instant messaging technology circle”.

▲ The link to this article on the public number is: click here to enter. The synchronous publishing link is:http://www.52im.net/thread-36…