How to design a Twitter feed stream

A background.

Weibo, wechat Moments and Douyin are all typical feed streaming products, that is, our browsing content is composed of feeds sent by others.

This article attempts to analyze the design of weibo feed flow. If you have any questions, please point out.

How to design a Microblog feed stream

1. Storage design

There are three main parts in data storage

1) Feed storage

It is a store of content published by users. This part of content needs to be permanently stored, and users should be able to see it no matter how long it takes to view their personal homepage

The data structure is simplified as follows, and is divided horizontally by userId

Create table 't_feed' (' feedId 'bigint not null PRIMARY KEY,' userId 'bigint not NULL COMMENT' founder ' 'content' text, 'recordStatus' tinyInt not NULL default 0 comment' InnoDB ')ENGINE=InnoDB;Copy the code

2) Focus on relational storage

Is a store of relationships between users and a dependency that controls how users can see the scope of the feed, and also needs to be stored permanently.

The data structure is simplified as follows (to be optimized) :

CREATE TABLE `t_like`(
    `id` int(11) NOT NULL PRIMARY KEY, 
    `userId` int(11) NOT NULL, 
    `likerId` int(11) NOT NULL,
    KEY `userId` (`userId`),
    KEY `userId` (`likerId`),
)ENGINE=InnoDB;
Copy the code

3) Feed synchronous storage

For the feed stream display, it can be interpreted as an inbox, to which the following people post their feed.

You can save the content within a period of time based on service scenarios. Cold data can be archived or deleted.

The data structure is simplified as follows:

Create table 't_INBOX' (' id 'bigint not null PRIMARY KEY,' userId 'bigint not NULL comment' recipient ID', 'feedId' bigint not NULL comment 'content ID',' createTime 'datetime not null)ENGINE=InnoDB;Copy the code

2. Scene Features

1) Read more and write less

There is a huge gap in the reading and writing ratio, typical scenario of reading more than writing less.

2) Orderly presentation

You need to sort the display according to the ranking value of timeline or feed.

3. Use push mode to implement

Push mode is also called write diffusion mode. After the followers post content, they will actively push the content to the followers and write it into the inbox of the followers.

1) scheme

After a follower posts a piece of content, it obtains all users who follow this person and then iterates through the data to insert the content into the inbox of these users, as shown in the following example:

/** insert into t_feed (' feedId ',' userId ',' content ',' createTime ') values (10001,4,' content ','2021-10-31 17:00:00 '); ** select userId from t_like where liker = 4; **/ insert into t_inbox (' userId ',' feedId ',' createTime ') values (1,10001,'2021-10-31 17:00:00'); Insert into t_inbox (' userId ',' feedId ',' createTime ') values (2,10001,'2021-10-31 17:00:00'); Insert into t_inbox (' userId ',' feedId ',' createTime ') values (3,10001,'2021-10-31 17:00:00');Copy the code

1. When user ID 1 views the feed stream, all data in the inbox table is checked as shown in the following example:

select feedId from t_inbox where userId = 1 ;
Copy the code

1. The data is aggregated and sorted

2) Existing problems

1. Poor immediacy

When big V is followed by many, many users, traversing for fans to insert data is time-consuming, and users cannot receive content in time

Possible solutions:

1. The task can be pushed into the message queue, and the consumer side is concurrently consumed by multiple threads. 2. Use a database with high insertion performance and high data compression ratioCopy the code

2. High storage costs

Every fan needs to store a copy of weibo data of followers. When the number of big V fans is very high, the amount of inserted data increases exponentially.

And microblog can group the bloggers that they follow, so the data has to be inserted not only in the entire inbox, but also in the inbox of the group.

Possible solutions:

Data is separated from hot and cold. The hot storage only stores the data within a short period of time, while the cold storage stores the data for more than a period of time. The cold and hot storage periodically cleans the data.Copy the code

As the user base continues to grow, there will eventually be a bottleneck with this design

3. Synchronize data status

When a concerned user deletes a microblog or takes down a blogger, the content in the inbox of all fans needs to be deleted. There is still a problem of immediacy of writing proliferation

Possible solutions:

Judge the state of micro-blog when pulling data, filter deleted/removed micro-blog filteringCopy the code

The above solutions can improve efficiency to a certain extent, but cannot solve the root cause of the problem.

summary

Push mode is only applicable to the situation where the number of fans is not too large, such as wechat moments, which can better control the instant accessibility and data storage cost.

It is not suitable for the scene with a large number of fans such as Weibo great V.

2. Use pull mode

The pull mode is also called read diffusion mode. After the pull mode is used, users obtain data as follows:

Get the ID of all the bloggers you care about.

select liker from t_like where userId = 1;
Copy the code

Pull content by blogger ID.

Select * from t_feed where userId in (4,5,6) and recordStatus = 0;Copy the code

Get all the content and sort it according to timeline.

This solution solves three problems in push mode, but it also introduces another performance problem.

If there are a lot of bloggers that the user is concerned about, pulling all the content and sorting and aggregating will take a lot of time and the request delay will be very high.

So how to achieve low consumption, complete fast response?

The database alone cannot meet the requirements, so we need to introduce a cache layer (sharding) in the middle, through caching to reduce disk IO.

The process is:

Concern list caching

Caches all blogger ids that the user follows. The user ID is the key and value is the set of blogger ids

Weibo content cache

Take the blogger ID as the key and value as the collection of microblog content. After a blogger posts a microblog, the content of the microblog is stored in the collection

Get the feed stream

According to the set of blogger ids concerned, all content is pulled from all cache sharding nodes and sorted and aggregated.

If the cache sharding cluster is three-master and three-slave, that is, it takes three requests to pull all the content, and then reverse the time and respond to the user

Existing problems

The system is under a lot of read pressure

If users follow 1000 bloggers, they need to pull all the published content of these 1000 bloggers and sort and aggregate it, which puts great pressure on caching service and bandwidth.

Possible solutions:

The cache node has one primary node and multiple secondary nodes, which can be expanded horizontally to disperse read pressure and bandwidth bottlenecksCopy the code

summary

For large V users, pull mode can solve the problem of write diffusion well, but also bring the problems mentioned above.

3. To summarize

Analyzing the advantages and disadvantages of push and pull modes, we can easily find them

Push mode is suitable for scenarios with small number of followers. Such as moments, one-on-one chat.
Pull mode is suitable for fans of large V users. For example, Weibo V.

So when designing a scene, you can use both push and pull patterns. Logic is as follows

Set a threshold for the number of vFans. When the threshold is reached, the event of labeling users is triggered.
Write diffusion is still used for users who do not reach the threshold, so that the amount of redundant data is not too large, and there is no immediacy problem.
When the users who reach the threshold send microblog, the microblog content will be stored in the cache (hot data), and the data will be sorted and aggregated with the data in the inbox instead of being written and diffused.

PS: Here you can also maintain a list of active fans through user behavior. For the fans in this list, you can also conduct a write spread behavior to ensure instant access.

A background.

How to design a Microblog feed stream

1. Storage design

1) Feed storage

2) Focus on relational storage

3) Feed synchronous storage

2. Scene Features

1) Read more and write less

2) Orderly presentation

3. Use push mode to implement

1) scheme

2) Existing problems

1. Poor immediacy

2. High storage costs

3. Synchronize data status

summary

2. Use pull mode

Existing problems

summary

3. To summarize

Related Posts

Use genetic algorithm to solve N queen problem

HyperLedger Fabric(HyperLedger) primer combat

Interviewer: How to design the second kill system of SAO Qi?