IM development technology learning: reveal the system design behind the information push of WeChat Moments

This article was published by Xu Ning in the Tencent Lecture Hall. The original title "How do programmers push the content of your concern to you? Reveal the system design behind the information flow recommendation", with changes and revisions.

1 Introduction

Information push (hereinafter referred to as "Feed Stream") is almost ubiquitous in our mobile apps (especially in social/community products), and the most commonly used ones are WeChat Moments, Sina Weibo, etc.

The definition of feed stream can be simply understood as as long as the thumb keeps swiping down the phone screen, there will be pieces of information constantly emerging. It's like feeding livestock, as long as it eats up, it will continue to go to Riga, hence the name Feed (raising).

Most products with a feed stream feature include two feed streams:

1) One is algorithm-based: dynamic algorithm recommendation, such as Toutiao, Douyin short video;
2) One is based on attention: social/friend relationships, such as WeChat and Zhihu.

For example, in Weibo and Zhihu in the figure below, the page cards in the top column all contain two types of "follow" and "recommend":

For the two feed streams shown in the figure above, the technology used behind them will be quite different. Different from the "recommendation" page card, which is recommended by a thousand people and a thousand faces algorithm, usually the content displayed on the "following" page card has a fixed order. The most common rule is to sort based on the timeline, that is, display "The posts, dynamics, and moods of the people I follow are arranged in order from late to early according to the posting time."

This article will focus on the technical implementation of the "Follow" function: first summarize the commonly used back-end technology implementation solutions based on the timeline feed stream, and then combine specific business scenarios to make some flexible applications in the basic design ideas according to actual needs. .

Learning Exchange:

5 groups for instant messaging/push technology development and communication: 215477170 [recommended]

Introduction to Mobile IM Development: "One entry is enough for novices: Develop mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article was published simultaneously at: http://www.52im.net/thread-3675-1-1.html)

2. The author of this article

Xu Ning: Tencent application development engineer, lecturer at Tencent College, graduated from Shanghai Jiaotong University. He is currently responsible for the back-end development of Tencent's smart retail business, and has extensive experience in live video and automated marketing system development.

3. Feed stream technology realization scheme 1: read diffusion

Reading diffusion is also known as the "pull mode", which should be a technical realization method that best suits our cognitive intuition.

The principle is as follows:

As shown in the figure above: each content publisher has its own outbox ("I publish content"), and every time we send a new post, it will be stored in its own outbox. When our fans come to read, the system first needs to get everyone who the fans follow, then traverse the outboxes of all publishers, take out the posts they have published, and then sort them according to the time of publication and show them to the readers.

This design: the reader reads the feed stream once, and the background will spread into N read operations (N is equal to the number of followers) and one aggregation operation, so it is called read diffusion. Each time you read the feed stream, it is equivalent to going to the inbox of the follower to actively pull the post, hence the name-pull mode.

This model:

1) The advantage is: the underlying storage is simple, and there is no waste of space;
2) The disadvantage is: each read operation will be very heavy and there will be a lot of operations.

Imagine this: If I follow a very large number of people, traverse all the people I follow, and aggregate again, the system overhead will be very large, and the delay may reach intolerable levels.

Therefore: Reading diffusion is mainly applicable to scenarios where the readers in the system do not follow so many people and the feed stream is not frequently used.

There is another big disadvantage of the pull mode: it is inconvenient to paging. When we scan Weibo or Moments, the content must be pulled from the background page by page as the thumb is constantly swiping on the screen. If you don’t do other optimizations and only use real-time aggregation, it will be very troublesome to slide down to a lower page number.

4. Feed stream technology realization scheme 2: write diffusion

According to statistics: the read-write ratio of most feed stream products is about 100:1, which means that in most cases, you use the feed stream to read the Moments and Weibos posted by others, and only rarely do you post a Moments by yourself. Or show it to others on Weibo.

Therefore: the heavy reading logic of read proliferation is not suitable for most scenarios.

We would rather make the posting process more complicated than affect the user's experience of reading the feed stream, so a little modification of the previous scheme will lead to writing proliferation. Write diffusion is also called "push mode", this mode will improve some of the shortcomings of the pull mode.

The principle is as follows:

As shown in the figure above: In addition to the outbox, each user in the system will also have their own inbox. When a publisher publishes a post, in addition to recording it in his own outbox, it also traverses all the publisher's fans, and puts a copy of the same content into the inbox of these fans. In this way, when readers come to read the feed stream, they can read it directly from their inbox.

This design: Every time a post is published, it will spread to M write operations (M equals the number of fans), so it becomes a write spread. Each post will be actively pushed to the inbox of all fans, hence the name push mode.

This model can be imagined: posting a post will involve many write operations behind it. Usually for the user experience of the poster, when the posted post is written to its outbox, it can return to the successful post. Another asynchronous task is set up in the background, and you can deliver the post to the fan inbox without any hassle.

The advantage of write diffusion is that it improves the user experience of readers through data redundancy (M copies of a post will be stored). Normally proper data redundancy is not a problem, but for Weibo celebrities, it doesn't work at all. For example, Xie Na and He Jiong, who currently have the top 2 Weibo fans, have over 100 million Weibo fans.

Imagine this: If you simply use the push model, every time Xie Na He Jiong sends a Weibo, there will be an earthquake in the background of Weibo. A Weibo led to hundreds of millions of write operations in the background, which is obviously not feasible.

In addition: Because writing spread is an asynchronous operation, writing too slowly will cause the post to be sent for a long time, and some fans still can't see it, and this experience is not very good.

Usually writing diffusion is suitable for situations where the number of friends is not large, for example, WeChat Moments is the writing diffusion mode. The maximum number of friends for each WeChat user is 5,000, which means that you can send a circle of friends to 5,000 write operations at most. If the performance of asynchronous tasks is better, there is no problem at all.

Technical information about WeChat Moments:

1) "WeChat Moments Massive Technology PPT [Attachment Download]"
2) "Summary of Technical Challenges and Practices Behind the 100 Billion Visits of WeChat Moments"
3) "The Way of Architecture: 3 Programmers Achieve an Average Daily Publishing Volume of 1 billion WeChat Moments [With Video]"
5. Feed stream technology realization plan 2: Read and write mixed mode
Read-write mixing can also be called "push-pull combination", which can have the advantages of both read diffusion and write diffusion.

Let's first summarize the advantages and disadvantages of read diffusion and write diffusion:

See the picture above: Carefully compare the advantages and disadvantages of read diffusion and write diffusion. It is not difficult to find that the application scenarios of the two are complementary.

Therefore: when designing the back-end storage, if we can distinguish the scenes, choose the most suitable scheme in different scenes, and dynamically adjust the strategy, we can realize the mixed mode of reading and writing.

The principle is as follows:

Take Weibo as an example: when a person like He Jiong with a large number of fans post, write the post into He Jiong's outbox, and extract the more active batch of He Jiong fans (this can already screen out most of them) , Write He Jiong’s post to their inbox. When a passerby with a small number of fans posts a post, it uses the writing diffusion method to traverse all his fans and write the post to the fan inbox.

For those active users who log in and brush the feed stream: he can read the posts directly from his inbox, which ensures the experience of active users.

When an inactive user suddenly logs in to refresh the feed stream:

1) On the one hand, he needs to read his inbox;
2) On the other hand, you need to traverse the outbox of the big V users he is following to extract the posts, and do an aggregate display.

After the presentation: The system also needs a task to determine whether it is necessary to upgrade the user to an active user.

Because there are scenes of reading proliferation, even if it is a mixed mode, the number of people that each reader can follow must be set at an upper limit. For example, Sina Weibo limits each account to a maximum of 2,000 people.

If there is no upper limit: Imagine a user who has followed all Weibo accounts, then he opens the watch list and reads all posts on Weibo. Once there is a read spread, the system will inevitably crash (even if it is a write spread, he Also can't hold so many Weibo in their inbox).

In the read-write mixed mode, the system needs to make two judgments:

1) Which users belong to the big V, we can use the number of fans as a judgment indicator;
2) Which users are active fans? This criterion can be the last login time, etc.

These two judgment standards need to be dynamically identified and adjusted during the development of the system, and there is no fixed formula.

It can be seen that the read-write combined mode combines the advantages of the two modes and is the best solution.

However, his shortcoming is: the system mechanism is very complicated, which brings countless troubles to programmers. Usually at the beginning of the project, when there are only one or two developers and the user scale is small, it is still necessary to be cautious to adopt this hybrid mode in one step, and it is easy to cause bugs. When the scale of the project has gradually developed to the level of Sina Weibo, and a large team is dedicated to do the feed stream, the mixed reading and writing mode is necessary.

6. Pagination issues in the feed stream

The previous sections have described several common design schemes for feed streams based on timelines, but the actual operation is much more troublesome than the theory.

Next, we will specifically discuss a pain point in the feed stream technology solution-the paging of the feed stream.

Regardless of read proliferation or write proliferation, the feed stream is essentially a dynamic list, and the content of the list will continue to change over time. The traditional front-end paging parameters use page_size and page_num, and the sub-table indicates how many items are per page and the current page.

For a dynamic list, there will be the following problems:

As shown in the figure above: the first page is read at T1, and someone newly published "Content 11" at T2. If you pull the second page at T3, it will cause dislocation, and "Content 6" is on the first page. And the second page was returned. In fact, any addition or deletion of content between two pages will cause misalignment.

In order to solve this problem: usually the paging input parameters of the feed stream will not use page_size and page_num, but use last_id to record the id of the last content of the previous page. When the front end reads the next page, it must use last_id as an input parameter, and the background directly finds the data corresponding to last_id, and then offsets the page_size piece of data and returns it to the front end, thus avoiding the problem of misalignment.

As shown below:

The last_id scheme has an important condition: the data of last_id itself cannot be hard deleted.

Imagine:

1) In the above figure, 5 pieces of data are returned at time T1, and last_id is content 6;
2) At T2, content 6 is deleted by the publisher;
3) Then we come to request the second page at T3, we can't find the data corresponding to last_id at all, and we can't confirm the page offset.

Usually encountered deletion scenarios: We use soft deletion, but set a flag on the content to indicate that the content has been deleted.

Since the deleted content should not be returned to the front end, in the soft delete mode, find the last_id and shift the page_size back. If there is deleted data in it, it will result in enough data to be given to the front end.

One of the solutions here is to continue to look down if it is not enough. Another solution is to negotiate with the front-end to allow the number of returned items to be less than page_size. Page_size is only a suggested value. Even after everyone has agreed, the page_size parameter can be omitted.

7. The practice of the feed stream technology solution in a live broadcast application

7.1 Demand background
This section will combine the actual business to share a very special feed stream design scheme encountered in the actual scene.

xx live broadcast is a live streaming tool. The host can create a live broadcast of a future moment, and the sales will start after the time is up. After the live broadcast is over, the host’s fans can view the live broadcast playback.

In this way, each live broadcast has three states-preview (a live broadcast is created but not yet started), live broadcast, and playback.

As a viewer, I can follow multiple anchors, so from the perspective of fans, there will also be a feed stream page for live sessions.

The most special thing about this feed stream is its sorting rules:
](/img/bVcUdJG)

Explain the sorting rules of this feed stream:

1) All the anchors I follow: the live broadcast is ranked first; the preview is in the middle; the replay is the last;
2) Multiple sessions are in the live broadcast: sorted by the start time from late to early;
3) Many shows are in the preview: sorted according to the expected start time from morning to night;
4) Multiple sessions are being played back: sorted by the end time of the live broadcast from late to early.

7.2 Problem analysis
The most complicated point of this requirement lies in the "status" factor incorporated into the feed stream content. The change of status will directly cause the feed stream sequence to be different.

In order to explain the impact on the sorting more clearly, we can use the following figure to explain in detail:

The above picture: shows 5 live broadcasts of 4 anchors. As a viewer, when I open the page at T1, the order I see is that the 3 is at the top, and the rest are in the preview state. The expected start time is as early as possible. Show to night. When I opened the page at T2, game 5 was at the top, and the remaining three games were ranked in the middle in the preview state, and game 3 was over, so it was ranked last. By analogy, until all live broadcasts are over, the final status of all sessions will be replayed.

One thing to note here: If I open the first page at time T1, and then stare at the page motionless, stare at time T4 and then underline to the second page, then the last_id of the previous page, that is, the page offset is very large It may be that the status of the live broadcast is changed and you don’t know where you are flying. This will cause serious misalignment and inconsistent live broadcast status (the first page shows the live status at T1, and the second page shows the T4 Live streaming status).

7.3 Solution
The live broadcast system is a one-way relationship chain, which is similar to Weibo. Each viewer will follow a small number of anchors, and each anchor may have very many followers.

Due to the existence of state changes, write diffusion is almost impossible to achieve.

Because: If the write diffusion method is adopted, the status change of the show every time the host creates the live broadcast, the live broadcast starts, and the live broadcast ends, will spread to a very large number of write operations, which is not only complicated, but also has a delay. Can not accept.

The reason why Weibo can be written and spread is because after a Weibo is posted, there will be no more status changes that affect the ranking of the Weibo.

And in our scenario: "Telling" and "Live" are two intermediate states, and the "Playback" state is the final destination of all live broadcasts. Once it enters the playback, this live broadcast will no longer change its state. Therefore, the "live" and "pre-announcement" states can adopt the read diffusion method, and the "playback" state can adopt the write diffusion method.

The final plan is shown in the figure below:

As shown in the figure above: the three events (create live broadcast, start broadcast, end live broadcast) that will affect the live broadcast state are all processed asynchronously by the listening queue.

We maintain a priority queue of live broadcast + preview status for each host:

1) Whenever a live broadcast created by an anchor is monitored, the live broadcast will be added to the queue, and the score will be the opposite (negative number) of the timestamp of the start of the broadcast;
2) Whenever it is monitored that there is an anchor starting, modify the score of the live broadcast in the queue to the starting time (positive number);
3) Whenever a host is monitored to end the live broadcast, the broadcast information is delivered asynchronously to the playback queue of each viewer.
Here is a little trick: As mentioned above, the status in the live broadcast is sorted from largest to smallest in the start time, while the status in the preview is sorted from smallest to largest in the start time. Therefore, if the scores of the states in the preview are all taken as the opposite of the start time, The sorting also becomes from largest to smallest. Such a conversion can ensure that the live broadcast and the preview are in the same queue. The scores in the preview are all negative, and the scores in the live broadcast are all positive. The final aggregation can ensure that all the live broadcasts are naturally ranked first in the preview.

In addition: Another problem mentioned in the previous article is that the first page is pulled at T1 and the second page is pulled at T4, resulting in the inconsistent state of the first page and the second page of the live broadcast.

The solution to this problem is to use the snapshot method: when the audience pulls the first page of the feed stream, we create a snapshot of all the live broadcast and the state of the preview according to the current time, using a session_id identifier, each time the front end When the page is pulled, we can read it directly from the snapshot. If the reading in the snapshot is completed, it proves that the audience's live broadcast and preview scenes are all read, and the rest will be supplemented by the playback queue.

In this way, in our feed stream system, there are a total of 4 parameters pulled by the front-end paging:

Whenever the session_id and last_id are empty, it proves that the user wants to read the first page and needs to rebuild the snapshot.

There is also a derivative question: how to take the value of session_id?

the answer is:

1) If you don't consider the case of the same viewer logging in on multiple terminals, in fact, each viewer can maintain a snapshot id, that is, directly set the system user id to session_id;
2) If multi-terminal login is considered, the session_id must contain the information of each terminal to avoid the mutual influence of multi-terminal snapshots;
3) If you don't care about memory, you can also randomly use a string as session_id every time, and set an expiration time long enough to allow the snapshot to expire naturally.

The above design: In fact, the moment when the system has the largest amount of calculation is the cost of fetching the first page and constructing a snapshot.

The current online data, for viewers who only follow less than 10 anchors (this is also most scenes), the QPS of the first page can reach 15,000. If the requests after the second page are also included, the comprehensive QPS of the feed stream can reach a higher level, which is more than enough to support the current user scale. If we only get the first 10 items when we pull the first page, we can return directly, and change the snapshot construction operation to asynchronous, maybe the QPS can be higher, which may be a subsequent optimization point.

8. Summary of this article

Almost all feed streams based on timelines and follow relationships cannot escape three basic design patterns:

1) Read diffusion;
2) Write diffusion;
3) Read and write mixed.

Specific to the actual business, there may be more complex scenarios, such as the one mentioned in this article:

1) The status flow affects the ranking;
2) In the Weibo Moments scene, there will be advertising access, special attention, hot topics and other factors that may affect the ranking of the feed stream.

These scenarios can only be adapted according to business needs.

Appendix: More articles on social application architecture design

"On the architecture design of IM system"
"A brief description of the pits of mobile IM development: architecture design, communication protocol and client"
"A set of mobile IM architecture design practice sharing for massive online users (including detailed graphics and text)"
"An Original Distributed Instant Messaging (IM) System Theoretical Architecture Plan"
"From Zero to Excellence: The Evolution of the Technical Architecture of JD's Customer Service Instant Messaging System"
"Mushroom Street Instant Messaging/IM Server Development Architecture Selection"
"Tencent QQ's 140 million online users' technical challenges and architecture evolution PPT"
"WeChat background based on the time series of massive data cold and hot hierarchical architecture design practice"
"WeChat Technical Director Talks about Architecture: The Way of WeChat-Dao Zhi Jian (Full Speech)"
"How to Interpret "WeChat Technical Director Talking about Architecture: The Way of WeChat-The Road to the Simple""
"Rapid Fission: Witness the evolution of WeChat's powerful back-end architecture from 0 to 1 (1)"
"How to ensure the efficiency and real-time performance of large-scale group message push in mobile IM? 》
"Discussion on the Synchronization and Storage Scheme of Chat Messages in Modern IM System"
"Technical Challenges and Practice Summary Behind the 100 Billion Visits of WeChat Moments"
"Take Weibo application scenarios as an example to summarize the architectural design steps of massive social systems"
"Behind the glamorous bullet message: the chief architect of Netease Yunxin shares the technical practice of the billion-level IM platform"
"Knowing the technology sharing: the road to practice of Redis high-performance caching from a single machine to 20 million concurrent QPS"
"WeChat Technology Sharing: Practice of Generating Massive IM Chat Message Sequence Numbers in WeChat (Principles of Algorithms)"
"WeChat Technology Sharing: Practice of Generating Massive IM Chat Message Serial Numbers in WeChat (Disaster Recovery Plan)"
"A set of high-availability, easy-scalable, and high-concurrency IM group chat and single chat architecture design practices"
"Social Software Red Envelope Technology Decryption (1): Comprehensive Decryption of QQ Red Envelope Technology Scheme-Architecture, Technical Implementation, etc."
"Social Software Red Envelope Technology Decryption (2): Decrypt WeChat Red Envelope Technology Evolution from 0 to 1"
"Social Software Red Envelope Technology Decryption (3): The technical details behind the WeChat Shake Red Envelope Rain"
"Social software red envelope technology decryption (4): How does the WeChat red envelope system deal with high concurrency"
"Social software red envelope technology decryption (5): How does the WeChat red envelope system achieve high availability"
"Social software red envelope technology decryption (6): The storage layer architecture evolution practice of WeChat red envelope system"
"Social software red envelope technology decryption (7): Alipay red envelope massive high-concurrency technical practice"
"Social Software Red Envelope Technology Decryption (8): Comprehensive Decryption of Weibo Red Envelope Technology Plan"
"Social software red envelope technology decryption (9): talk about the functional logic, disaster tolerance, operation and maintenance, architecture, etc. of the mobile Q red envelope"
"From guerrilla to regular army (1): the evolution of the IM system architecture of Mafengwo Travel Network"
"From guerrilla to regular army (2): Mafengwo Travel Network's IM Client Architecture Evolution and Practice Summary"
"From Guerillas to Regular Army (3): Technical Practice of Distributed IM System of Mafengwo Travel Network Based on Go"
"The data architecture design of Guazi IM intelligent customer service system (organized from the on-site speech, with supporting PPT)"
"Ali DingTalk Technology Sharing: Enterprise-level IM King-DingTalk's outstanding features in the back-end architecture"
"Design Practice of a New Generation of Mass Data Storage Architecture Based on Time Sequence in WeChat Backend"
"Ali Technology Sharing: E-commerce IM messaging platform, technical practice in group chat and live broadcast scenarios"
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 1): overall architecture, service split, etc."
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 2): reliability, orderliness, weak network optimization, etc."
"From novice to expert: How to design a distributed IM system with billions of messages"
"The Secret of the IM Architecture Design of Enterprise WeChat: Message Model, Ten Thousands of People, Read Receipt, Message Withdrawal, etc."
"Rongyun Technology Sharing: Fully Revealing the Reliable Delivery Mechanism of 100 Million-level IM Messages"
"IM Development Technology Learning: Demystifying the System Design Behind the Information Push of WeChat Moments"
More similar articles...

This article has been simultaneously published on the official account of "Instant Messaging Technology Circle".
The synchronous publishing link is: http://www.52im.net/thread-3675-1-1.html

IM development technology learning: reveal the system design behind the information push of WeChat Moments

1 Introduction

2. The author of this article

3. Feed stream technology realization scheme 1: read diffusion

4. Feed stream technology realization scheme 2: write diffusion

6. Pagination issues in the feed stream

7. The practice of the feed stream technology solution in a live broadcast application

8. Summary of this article

Appendix: More articles on social application architecture design

JackJiang

引用和评论

一年撸完百万行代码，企业微信的全新鸿蒙NEXT客户端架构演进之路

即时通讯安全篇（一）：正确地理解和使用Android端加密算法

全民AI时代，大模型客户端和服务端的实时通信到底用什么协议？

社交软件红包技术解密(六)：微信红包系统的存储层架构演进实践

大型IM稳定性监测实践：手Q客户端性能防劣化系统的建设之路

融云数据监控平台「北极星」教程，聊天室洪峰、连接异常、消息未达正确解法

极致出海友好，融云 IM 支持消息免打扰设置时区