Chat technology in live broadcast system (8): Architecture practice of IM message module in vivo live broadcast system

This article is shared by LinDu and Li Guolin of vivo Internet technology team, and there are many revisions and changes.

1 Introduction

The IM instant message module is an important part of the live broadcast system. A stable, fault-tolerant, flexible, and high-concurrency message module is an important factor affecting the user experience of the live broadcast system. This article is aimed at the live broadcast of the show, combined with our handling of different business online issues over the past year, to upgrade and adjust the technology-evolved IM message module architecture, and to summarize and organize the technology accordingly. I hope to take this opportunity. Share with everyone.

In most mainstream live broadcast systems, push-pull streaming is the most basic technical point for realizing live video services, and IM real-time messaging technology is a key technical point for realizing interaction between all users watching the live broadcast and the host.

Through the IM message module in the live broadcast system, we can complete the development of core show live broadcast functions such as public screen interaction, color barrage, network-wide gift broadcasting, private messages, and PK. "IM message" is an information bridge for "communication" between users and users, and between users and anchors. How to ensure the stability and reliability of the "information bridge" in high concurrency scenarios is an important topic in the evolution of the live broadcast system.
study Exchange:

Introductory article on mobile IM development: "One entry is enough for beginners: developing mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK (click here for alternate address)

(This article is simultaneously published at: http://www.52im.net/thread-3994-1-1.html )

2. Series of articles

This article is the 8th in a series of articles: "Live Broadcasting System Chat Technology (1): The Road to Real-time Push Technology for the Millions Online Meipai Live Broadcasting Barrage System" "Live Broadcasting System Chat Technology (2): Ali E-commerce IM Messaging Platform, Technical Practice in Group Chat and Live Streaming Scenarios", "Live Broadcasting System Chat Technology (3): The Evolution of WeChat Live Chat Room with 15 Million Online Messages in a Single Room", "Live Broadcasting System Chat Technology (4): Baidu Live Real-time messaging system architecture evolution practice for massive users, "Live broadcast system chat technology (5): Cross-process rendering and push streaming practice of WeChat mini-game live broadcast on Android side", "Live broadcast system chat technology (6): Live broadcast room with millions of people online in real time Chat Message Distribution Technology Practice", "Live Broadcasting System Chat Technology (7): Difficulties in Architecture Design of Massive Chat Messages in Live Broadcasting Room", "Live Broadcasting System Chat Technology (8): Architecture Practice of IM Message Module in vivo Live Broadcasting System" (* this article)

3. Technical characteristics of live news

In the live broadcast business, there are several core concepts about the message model. Let's briefly summarize it first, so that everyone can have an overall understanding of the live broadcast-related message model.
3.1 Entity relationship The entities corresponding to the message module of the live broadcast system are the anchor and the audience. Hosts and viewers: For the IM system, they are all ordinary users and have a unique user ID (user ID), which is also an important ID for IM to distribute peer-to-peer messages. Host and room number: A host corresponds to a room ID (RoomId). Before the host starts broadcasting, after verifying the identity information, it will bind a unique room number. The room number is an important identifier for the IM system to distribute messages in the live room.
3.2 Classification of message types According to the characteristics of live broadcast services, there are many ways to divide IM messages, for example: 1) According to the dimension of the receiver; 2) According to the type of message service in the live broadcast room; 3) According to the priority of the message; 4) Divide according to the storage method of the message and so on. According to the receiver dimension, we divide it as follows: 1) peer-to-peer messages (single chat messages); 2) live broadcast messages (group chat messages); 3) broadcast messages (system messages). According to the specific business scenario, we divide it like this: 1) gift messages; 2) public screen messages; 3) PK messages; 4) business notification messages. It is very necessary that messages can be distributed to corresponding groups or individual user terminals in real time and accurately. Of course, a good IM message model can also empower the business with some new capabilities, such as: 1) Count the real-time online number of each live broadcast room; 2) Capture the events of users entering and exiting the live broadcast room; 3) Count each user entering the live broadcast in real time time between.
3.3 Message priority IM messages in the live broadcast system have priority, which is very important. The difference from standard social chat IM products such as WeChat and QQ is that the messages in the live broadcast room are prioritized. For standard social IM products such as WeChat, whether it is a private chat or a group chat, the priority of sending messages is basically the same for everyone. There is no one whose message has a higher priority and whose message has a lower priority. The message needs to be accurate and real-time. distributed to each service terminal. However, due to different business scenarios, the priority of message distribution is also different. For example: If a live broadcast room can only render 15~20 messages per second, and a hot live broadcast room generates more than 20 messages per second, if the message priority is not controlled, the messages will be distributed directly in real time. , then the result is that the rendering of the public screen client in the live broadcast room is stuck, the rendering of the gift box is too fast, and the user viewing experience is greatly reduced. Therefore, we need to give different message priorities for messages of different business types. Another example: Gift messages are higher than public screen messages, messages of the same business type, large gifts have higher priority than small gifts, and high-level users' public-screen messages have higher priority than low-level users or anonymous users. For public-screen messages, when distributing business messages, it is necessary to selectively and accurately distribute messages according to the actual message priority.

4. The message module architecture model of the live broadcast system

The message module architecture model is shown in the following figure:

As shown in the figure above, the interaction mode of messages in our message module is push-pull combination. The short polling technology for "pull" and the long connection technology for "push" will be introduced in detail below.

5. Short polling technology

As shown in the architecture diagram in the previous section, the short polling technique is used in our architecture. This section will describe it in detail. (For the principle of short polling technology, you can see this "Quick Start of IM Communication Technology on Web Pages: Short Polling, Long Polling, SSE, WebSocket") 5.1 Business Model of Short Polling First, let's briefly describe it The timing logic and design idea of short polling: 1) The client polls the server interface every 2s, and the parameters are roomId and timestamp (timestamp passes 0 or null for the first time); 2) The server queries the room on timestamp according to roomId and timestamp The message event generated after the timestamp returns a limited number of messages, for example (for example, 10~15 messages are returned, of course, the number of messages generated after this timestamp is much larger than 15, but because of the limited rendering capabilities of the client and too many messages display will affect the user experience, so limit the number of returned messages), and at the same time return the timestamp generated by the last message in these messages, as the benchmark request timestamp for the client's next request to the server; 3) Repeat this, so that You can update the latest news of each live broadcast room every 2s according to the requirements of each terminal.

The overall technical logic is shown in the figure above, but the specific timing can be further refined, and then specific explanations and detailed explanations will be made later.
5.2 Storage Model of Short Polling The message storage of short polling is different from that of normal long connection, because it does not have the problem of message diffusion. The message storage we need to do needs to achieve the following business goals: 1) The time complexity of message insertion should be relatively low; 2) The complexity of message query should be relatively low; 3) The structure of message storage should be relatively small, It cannot occupy too much memory space or disk space; 4) Historical messages can be persistently stored on disk according to business needs. Combined with the technical requirements of the above 4 points, after discussion among the team members, we decided to use the SortedSet data structure of Redis for storage. Specific implementation ideas: According to the product business type of the live broadcast room, business messages are divided into the following four types: gifts, public screens, PK, and notifications. The messages of a live room are stored using four Redis SortedSet data structures. The keys of SortedSet are: 1) "live::roomId::gift"; 2) "live::roomId::chat"; 3) "live::roomId::notify"; 4) "live::roomId: :pk". The score is the timestamp when the message is actually generated, and the value is the serialized json string. As shown below:

When the client polls, the logic of the server query is as follows:

Many students will ask, why not apply the data structure of Redis's list? The following figure will explain in detail:

Finally: Let's compare the time complexity analysis of the two data structures of Redis' SortedSet and Redis' List when storing live messages (as shown below).

The above: These are some simple design considerations for using the SortedSet data structure of Redis for message storage. Later, we will also mention the points that need to be paid attention to when encoding the end-to-end polling.
5.3 Time Control of Short Polling The time control of short polling is very important. We need to find a good balance between the QoE of live viewer viewing experience and server pressure. The polling interval is long: the user experience will be degraded a lot, the live viewing experience will be worse, and there will be a "meal" feeling. The frequency of short polling is too high: it will cause too much pressure on the server, and there will be many "empty polling". The so-called "empty polling" is invalid polling, that is, valid polling returns valid in the last second. After the message, if no new message is generated in the live broadcast room during the interval, invalid polling will occur. At present, the daily polling times of vivo live broadcast is 1 billion+ billion times. When watching the live broadcast at night, the CPU load of the server and Redis will increase, and the thread pool of the service provider of dubbo has always been at the high water mark. This requires the horizontal expansion of the server and the node expansion of Redis Cluster according to the real-time load pressure of the machine and Redis, and even load some live broadcast rooms with ultra-high heat value to the designated Redis Cluster cluster to achieve physical isolation. Enjoy the "VIP" service to ensure that the news in each live broadcast room does not affect each other. For live broadcast rooms with different numbers of people, the polling time can also be configured: 1) For example, for live broadcasts with a small number of people, and for live broadcast rooms with less than 100 people, a relatively high polling frequency (such as about 1.5s) can be set; 2) If there are more than 300 people, it can be about 2s for less than 1,000 people; 3) The live broadcast room with 10,000 people can be set for about 2.5s. These configurations should be delivered in real time through the configuration center, and the client can update the polling time in real time. The frequency of adjustment can be based on the effect of the actual user experience in the live broadcast room, and combined with the load of the server, to find a relatively optimal polling interval. value.

5.4 Notes on short polling
1) The server needs to verify the timestamp passed by the client: This is very important. Imagine, if the audience exits the background while watching the live broadcast, the client polling process is suspended, and when the user resumes the live viewing process When the time passed by the client is very old or even expired, this time will cause a slow check when the server queries Redis. If there is a large number of slow server queries, the connection between the server and Redis cannot be released quickly, and the performance of the entire server will also be slowed down. There will be a large number of polling interface timeouts in an instant, and the quality of service and QoE will drop a lot. 2) The client needs to verify duplicate messages: In extreme cases, the client may receive duplicate messages. The reasons may be as follows. At a certain moment, the client sends a request for roomId=888888×tamp=t1, because the network Due to instability or server GC, the request processing is slow and takes more than 2s. However, because the polling time is up, the client sends a request for roomId=888888×tamp=t1, and the server returns the same data. Appears that the client repeatedly renders the same message for display. This will also affect the user experience, so it is necessary for each client to check for duplicate messages. 3) The problem that massive data cannot be returned for rendering in real time: Imagine if a very hot live broadcast room generates thousands or tens of thousands of messages per second, there are loopholes in the above storage and query ideas . Because we only return 10 to 20 messages each time due to various factors, it takes a long time to return all the hot one-second data, which will cause the latest messages to be unable to quickly Return first. Therefore, the messages returned by polling can also be selectively discarded according to the message priority.
5.5 Advantages and Disadvantages of Short Polling The advantage of client polling service server to query the messages in the live broadcast room is obvious. The distribution of messages is very real-time and accurate. However, the disadvantage is also very obvious. The load pressure of the server during the peak business period is very high. If all the messages in the live broadcast room are distributed through polling, in the long run, it is difficult for the server to achieve linear growth through horizontal expansion.

6. Long connection technology

6.1 Architecture of Persistent Connections

As shown in the figure above, the overall live broadcast long connection process is as follows: 1) The mobile client first requests the long connection server through http to obtain the IP address of the TCP long connection, and the long connection server returns the optimal connectable according to the routing and load policies. IP list; 2) According to the IP list returned by the persistent connection server, the mobile client performs the connection request of the persistent connection client, and the persistent connection server receives the connection request and establishes the connection; 3) The mobile client sends the authentication information, The authentication of the communication information and the confirmation of the identity information are carried out. Finally, the establishment of the long-term connection is completed. The long-term connection server needs to manage the connection, monitor the heartbeat, and perform operations such as disconnection and reconnection. The basic architecture diagram of the persistent connection server cluster:

As shown in the figure above, the cluster is divided into services based on regions, and terminal machines in different regions can access as needed.
6.2 Establishment and management of persistent connections In order to make messages reach users instantly, efficiently and safely, the live client and the IM system establish an encrypted full-duplex data channel, which is used for sending and receiving messages. When a large number of users are online, Maintaining these connections and maintaining sessions requires a lot of memory and CPU resources.

The IM access layer should try to keep its functions simple: business logic is processed in the following logic services. In order to prevent the release process, restarting the process will cause a large number of external network devices to re-establish connections, affecting the user experience. The access layer provides a hot update release scheme: basic logic that is not frequently changed, such as connection maintenance and account management, is placed in the main program, and the business logic is embedded in the program in the form of a so plug-in. When modifying the business logic, you only need to reload the plug-in once. That is, it can ensure that the long connection to the device is not affected.
6.3 Keep-alive for long-term connections After a long-term connection is established, if the intermediate network is disconnected, neither the server nor the client can perceive it, resulting in a false online situation. Therefore, a key problem in maintaining this "long connection" is to enable the "long connection" to be notified quickly when there is a problem with the intermediate link, and then reconnect to establish a new available connection. , so that our long connection remains highly available. Our approach is to enable the IM message module to enable the TCP keeplive detection mechanism on the server side, and enable intelligent heartbeat on the client side.

Using TCP's keeplive detection function, you can detect unexpected situations such as client crashes, intermediate network openings, and intermediate devices deleting connection-related connection tables due to timeouts, so as to ensure that the server can release half-opened TCP connections in the event of an accident. . When the client starts intelligent heartbeat, it can not only notify the server of the client's survival status and regularly refresh the NAT internal and external IP mapping table under the condition of consuming very little electricity and network traffic, but also automatically reconnect the long connection when the network changes. Jack Jiang's Note: In fact, under the mobile network, the keeplive mechanism of the TCP protocol itself is not very useful. If you are interested, you can read these two articles: "Why does the TCP-based mobile IM still need heartbeat keepalive? "," Thoroughly understand the KeepAlive mechanism of the TCP protocol layer". For more detailed information on the long-connection heartbeat mechanism, you can refer to: "Teach you to use Netty to implement the heartbeat mechanism and disconnection reconnection mechanism of network communication programs" "An article to understand the network heartbeat packet mechanism in instant messaging applications: role, principle , Implementation Ideas, etc." "Mobile IM Practice: Realizing the Intelligent Heartbeat Mechanism of Android WeChat", "Mobile IM Practice: Analysis of Heartbeat Strategy of WhatsApp, Line and WeChat", "Discussion on the Design and Implementation of an Android-side IM Intelligent Heartbeat Algorithm" (including sample code)" "Correctly understand the IM long connection, heartbeat and reconnection mechanism, and implement it by hand" "The 10,000-character long article: teach you to implement a set of efficient IM long connection adaptive heartbeat keep-alive mechanism" "Web terminal" Instant messaging practice dry goods: how to make your WebSocket disconnect and reconnect faster? 》

7. Real-time distribution of IM messages in the live room

7.1 Outline the overall flow chart of IM persistent connection distribution message:

When integrating the three modules of the client, the IM long-connection server module, and the live service server module, the overall message distribution logic follows several basic principles. The basic principles are as follows: 1) All the messages of single chat, group chat and broadcast message are called by the live service server to the interface of the IM persistent connection server, and the messages that need to be distributed are distributed to each service live room; 2) The service server communicates with the live room The generated events are processed in response to the corresponding business types, such as gift giving, deducting virtual currency, sending public screen for text health verification, etc.; 3) The client accepts the signaling control of the live service server, and the message is distributed through the long-connection channel The distribution of HTTP short polling is controlled by the live service server. The client shields the details of the underlying message acquisition method. The upper layer of the client accepts a unified message data format and performs message processing and rendering for the corresponding business type.
7.2 Member management and message distribution in the live broadcast room The members of the live broadcast room are the most important basic metadata of the live broadcast room. The number of users in a single live broadcast room is actually unlimited, and there are several large live broadcasts (more than 30W online at the same time), medium live broadcast rooms. Hundreds and tens of thousands of small live broadcasts are distributed like this. How to manage members of a live studio is one of the core functions of a live studio system architecture. There are two common management methods: 1) Allocate fixed shards for the live broadcast room: There is a mapping relationship between users and specific shards, and users are stored in each shard relatively randomly.

The algorithm of fixed sharding is simple to implement, but for a live broadcast room with few users, it is possible that a shard carries a small number of users. For a live broadcast room with a large number of users, the shard may carry a large number of users, and fixed shards have natural scaling. Sexual characteristics. 2) Dynamic sharding: Specify the number of sharded users. When the number of users exceeds the threshold, a new shard is added. The number of shards can change as the number of users increases.

Dynamic sharding can automatically generate shards according to the number of people in the live broadcast room. When the number of users in the live broadcast room is full, new pieces will be created. Try to make the number of users in each shard reach the threshold. However, the number of users in existing shards changes as users enter and leave the live broadcast room. Maintenance complexity relatively high.
7.3 Message distribution in the live room There are various messages such as entry and exit messages, text messages, gift messages and public screen messages in the live room. The importance of the messages varies, and each message can be set with a corresponding priority. Messages with different priorities are placed in different message queues. Messages with high priority are sent to the client first. When the message accumulation exceeds the limit, the oldest and lower priority messages are discarded. In addition: the messages in the live broadcast room are real-time messages. It is not meaningful for users to obtain historical messages and offline messages. The messages are stored and organized in the way of read diffusion. When sending a message in the live room: Notify the corresponding message sending service according to the members of the live room shard, and then send the message to each user corresponding to the shard. In order to deliver live room messages to users in real time and efficiently, when a user has multiple unreceived messages, the delivery service uses batch delivery to send multiple messages to the user.

7.4 Message compression for persistent connections When using TCP persistent connections to distribute live broadcast messages, you also need to pay attention to the size of the message body. If at a certain moment, the number of distributed messages is relatively large, or when the same message is in a multicast scenario, there are many users of the multicast, the egress bandwidth of the computer room at the IM connection layer will become a bottleneck for message distribution. Therefore, how to effectively control the size of each message and compress the size of each message is a problem that we also need to think about. We currently optimize the structure of related message bodies in two ways: 1) Use the protobuf protocol data exchange format; 2) Combine and send messages of the same type. After our online test, using the protobuf data exchange format, each message saves an average of 43% of the byte size, which can greatly help us save the bandwidth of the computer room exit. (For more information about protubuf, please read "Protobuf Communication Protocol Details: Code Demonstration, Detailed Principles, etc.", "Strongly Recommend Protobuf as Your Instant Messaging Application Data Transmission Format")
7.5 Block message The so-called block message is also a technical solution that we learn from other live broadcast platforms, that is, multiple messages are combined and sent. Instead of generating a message, the live service server immediately calls the IM persistent connection server cluster to directly distribute the message. The main idea is to take the live broadcast room as the dimension, every 1s or 2s, and distribute the messages generated by the business system during this time period at a uniform time interval. Distribute 10-20 messages per second. If the service server accumulates more than 10-20 messages per second, it will be discarded according to the priority of the messages. If the priority of these 10 to 20 messages is relatively high, for example, they are all gift-type messages, the messages will be sent in the next message block. The benefits of doing this are as follows: 1) Reduce the transmission header: Combine messages, which can reduce the transmission of redundant message headers, and send multiple messages together. In the custom TCP transmission protocol, the message header can be shared to further reduce the number of message bytes Size; 2) Prevent message storms: The live service server can easily control the speed of message distribution, and will not distribute messages to the live client without restrictions, and the client cannot process so many messages; 3) Improve user experience: live broadcast room Because the flow rate is normal and the rhythm of rendering is relatively uniform, it will bring a good user experience of live broadcast, and the whole live broadcast effect will be very smooth.

8. Message discarding strategy

Regardless of whether it is HTTP short polling or long connection, when there is a live broadcast room with a high heat value, there will be a situation where messages are discarded. For example: in the game live broadcast, when there is a more exciting moment, the number of comments on the public screen will increase instantly, and the news of sending low-value gifts will also increase instantly, which is used to show support for the wonderful operation of your players, then the server The number of messages distributed per second through IM long connection or http short polling will be thousands or tens of thousands. A sudden increase in messages will cause the following problems to the client: 1) The client receives a sudden increase in messages through a long connection, and the downlink bandwidth pressure increases suddenly, and other services may be affected (for example, the gift svga cannot be downloaded in time) 2) The client cannot quickly process and render so many gifts and public screen messages, the CPU pressure increases suddenly, and the audio and video processing will also be affected; 3) Due to the backlog of messages, it is possible to display long-expired messages. User experience (QoE) metrics will drop. So: for these reasons, the message is necessary to be discarded. To give a simple example: the priority of gifts must be higher than the public screen news, the news of the PK progress bar must be higher than the news of the whole network broadcast, and the news of high-value gifts is higher than the news of low-value gifts. According to these business theories, in our development practice, we can do the following controls: 1) Selectively discard low-priority messages: Combined with specific business characteristics, classify messages of various business types into different levels, and trigger flow control in message distribution. 2) Selectively discard "old" messages: two fields, creation time and sending time, are added to the message structure. When actually calling the long-connection channel, it is necessary to judge the current The interval between the time and the creation time of the message is too large. If it is too large, the message will be discarded directly; 3) Gain message (correction message): In business development, in the design of the message, try to design the gain message as much as possible. What is meant is that a subsequently arriving message can contain a preceding arriving message. For the above item 3): For example, for the message at 9:10, the PK value of anchor A and anchor B is 20 to 10, then the PK value of the PK message distributed at 9:11 is 22 to 10, and the increment cannot be distributed. Message 2:0, I hope the client will do the accumulation of PK bars (20+2:10+0). However, there are messages that are discarded due to network tremors or pre-message discarding. Therefore, distributing gain messages or correction messages can help services return to normal.

9. Write at the end

With any live broadcast system, as the business develops and the popularity of the live broadcast room continues to increase, the problems and challenges encountered by the messaging system will also follow. Whether it is a long-connected message storm or a large number of HTTP short-polling requests, it will lead to a sharp increase in server pressure, which we need to solve and optimize continuously. According to the business characteristics of each period, we need to continuously upgrade live broadcast messages and develop an evolvable IM message module to ensure the ability of message distribution to ensure the sustainable development of the business. The vivo live messaging module is also gradually evolving. The evolution is mainly driven by business development. With the diversification of business forms, more and more users are watching, and the functions of the system will gradually increase. In order to solve the actual performance problems encountered, code analysis and interface performance bottleneck analysis will be carried out one by one, and then corresponding solutions or decoupling solutions will be given, and the message module is no exception. I hope this article can bring you inspiration for the design of the IM message module in the live broadcast system.

10. References

[1] Thoroughly understand the KeepAlive mechanism of the TCP protocol layer
[2] Unplug the network cable and plug it back in, is the TCP connection still there? Understand it in one sentence!
[3] Detailed explanation of Protobuf communication protocol: code demonstration, detailed principle introduction, etc.
[4] Still using JSON? Protobuf makes data transmission cheaper and faster (principle)
[5] Why does the TCP-based mobile IM still need the heartbeat keep-alive mechanism?
[6] Mobile IM Practice: Realizing the Intelligent Heartbeat Mechanism of Android WeChat
[7] Teach you to implement a set of efficient IM long connection adaptive heartbeat keep-alive mechanism
[8] Web-side instant messaging technology inventory: short polling, Comet, Websocket, SSE
[9] Quick introduction to IM communication technology on the web page: short polling, long polling, SSE, WebSocket
[10] WeChat's new generation communication security solution: Detailed explanation of MMTLS based on TLS1.3 (this article is simultaneously published at: http://www.52im.net/thread-3994-1-1.html )

Chat technology in live broadcast system (8): Architecture practice of IM message module in vivo live broadcast system

1 Introduction

2. Series of articles

3. Technical characteristics of live news

4. The message module architecture model of the live broadcast system

5. Short polling technology

6. Long connection technology

7. Real-time distribution of IM messages in the live room

8. Message discarding strategy

9. Write at the end

10. References

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

极致出海友好，融云 IM 支持消息免打扰设置时区

Bilibili直播信息流：连接方法与数据解析

如何基于 Go 语言设计一个简洁优雅的分布式任务系统

软件架构模式实战指南：用真实血泪案例讲透技术选型

支持百万人超大群聊的Web端IM架构设计与实践

全平台开源即时通讯IM框架MobileIMSDK：7端+TCP/UDP/WebSocket协议