The message module evolution of the fun live broadcast series (3)

1. Background

The instant messaging (IM) system is an important part of the live broadcast system. A stable, fault-tolerant, flexible, and highly concurrent messaging module is an important factor affecting the user experience of the live broadcast system. The IM long-term connection service plays a pivotal role in the live broadcast system.

This article briefly describes the message model and the structure of our message model for the live broadcast of the show, and combines the evolution of the message model architecture by dealing with different business online issues over the past year. Adjust, organize this into a document, and share it with everyone.

In most of the current mainstream live broadcast services, push-pull streaming is the most basic technical point for realizing live broadcast services, and messaging technology is the key technical point for realizing the interaction between all users and anchors watching live broadcasts. Through the live IM system module, we can complete Public screen interaction, color barrage, whole network gift broadcasting, private messaging, PK and other core show live function development. "IM message" serves as an information bridge for "communication" between users and users, and between users and anchors. How to ensure that the "information bridge" remains stable and reliable in high concurrency scenarios is an important topic in the evolution of the live broadcast system.

2. Live message service

In the live broadcast business, there are several core concepts about the message model. Let's briefly introduce them first, so that everyone can have an overall understanding of the message model related to live broadcast.

2.1 Hosts and users

For the IM system, both the anchor and the audience are ordinary users, and they will have a unique user ID, which is also an important ID for the IM distribution to point-to-point messages.

2.2 Room number

A host corresponds to a room number (RoomId). The host will bind a unique room number after verification of identity information before the broadcast starts. The room number is an important identifier for the IM system to distribute messages in the live room.

2.3 Classification of message types

According to the characteristics of the live broadcast service, there are many ways to divide IM messages, such as dividing according to the receiver dimension, dividing according to the type of message service in the live broadcast room, according to the priority of the message, the storage method can be divided into different ways, and so on.

Generally, we divide the following types of messages according to the receiver dimension:

Point-to-point messaging (unicast messaging)
Live room message (group broadcast message)
Broadcast message

According to specific business scenarios, there are the following types of messages:

Gift message
Public screen message
PK news
Business notification messages

It is very necessary that messages can be accurately distributed to corresponding groups or individual user terminals in real time. Of course, a better IM messaging model can also empower some new capabilities in the business, such as the following capabilities:

Count the real-time online number of people in each live room
Capture the events of users entering and leaving the live broadcast room
Count the time that each user enters the live broadcast room in real time

2.4 Message priority

Live messages have priority, which is very important. Unlike chat IM products such as WeChat and QQ, the messages in the live broadcast room are prioritized.

For chat messaging products such as WeChat, whether it’s a private chat or a group chat, everyone’s message priority is basically the same. There is no one whose message priority is high or whose message priority is low. All messages need to be accurately and in real time. Distributed to various service terminals, but because of different service scenarios in live broadcast, the priority of message distribution is also different.

For example, if a live broadcast room can only render 15-20 messages per second, if a hot live broadcast room generates more than 20 messages per second or more, if the message priority is not controlled, it will be directly distributed in real time. Message, then the result is that the public screen client rendering in the live room is stuck, the gift box is rendered too fast, and the user viewing experience is greatly reduced. Therefore, we need to give different message priorities for messages of different business types.

For example, gift messages are larger than public screen messages, messages of the same business type, and messages with large gifts have a higher priority than messages with small gifts. The priority of public screen messages for high-level users is higher than that for low-level users or anonymous users. When distributing business messages, public screen messages need to be selectively and accurately distributed according to the actual message priority.

3. Information technology points

3.1 Message Architecture Model

3.2 Short polling VS long link

3.2.1 Short polling

3.2.1.1 Business model of short polling

First, let's briefly describe the process of short polling time and basic design ideas:

The client polls the server interface every 2s. The parameters are roomId and timestamp. Timestamp passes 0 or null for the first time.
The server queries the message events generated after the timestamp timestamp of the room according to the roomId and timestamp, and returns a limited number of messages such as (for example, 10~15 messages are returned. Of course, the number of messages generated after this timestamp is much greater than 15, but because The limited rendering capabilities of the client and excessive message display will affect the user experience, so limit the number of returned messages), and return the timestamp generated by the last message in these messages at the same time, as the client's next request to the server. Timestamp.
Repeat this, so that you can update the latest news of each live broadcast room every 2s according to the requirements of each terminal

The overall main idea is shown in the figure above, but the specific time can be refined, and specific instructions and details will be given later.

3.2.1.2 Short-polling storage model

There is a certain difference between short-polling message storage and normal long-connection message storage. There is no problem of message proliferation. The message storage we need to do needs to achieve the following business goals:

The time complexity of message insertion should be relatively low;
The complexity of message query is relatively low;
The structure of the message storage should be relatively small and should not occupy too much memory space or disk space;
Historical information can be persistently stored on disk according to business needs;

Combining the technical requirements of the above four points, after all, after discussion by the team members, we decided to use Redis's SortedSet data structure for storage. The specific realization idea: according to the type of live room product business, business messages are divided into the following four types: gift, public Screen, PK, notification.

The messages of a live broadcast room are stored using four Redis SortedSet data structures. The keys of the SortedSet are "live::roomId::gift", "live::roomId::chat", and "live::roomId::notify". ,"live::roomId::pk", score are the timestamps when the message is actually generated, and value is the serialized json string, as shown in the following figure:

When the client polls, the logic of the server query is as follows:

Many students will ask, why not apply Redis's list data structure? The following figure will explain in detail:

Finally, let's compare the analysis of the time complexity of the two data structures of Redis SortedSet and Redis List when storing live messages.

The above are some simple design considerations when we use Redis's SortedSet data structure for message storage. Later, we will also mention the points that we need to pay attention to when encoding end polling.

3.2.1.3 Time control of short polling

The time control of short polling is extremely important. We need to find a good balance between QoE and server pressure.

If the polling interval is too long, the user experience will drop a lot, the live viewing experience will become worse, and there will be a feeling of "one meal and one meal". The frequency of short polling is too high, which will cause excessive pressure on the server, and there will be many "empty polling". The so-called "empty polling" is invalid polling, that is, valid polling in the last second returns valid After the message, if no new message is generated in the live broadcast room during the interval, invalid polling will occur.

The current daily polling times of vivo live broadcast is 1+ billion times. During the peak period of watching live broadcast in the evening, the CPU load of the server and Redis will increase. Dubbo's service provider's thread pool has always been at the high water mark. This needs According to the pressure of the real-time load of the machine and Redis, do the horizontal expansion of the server and the node expansion of the Redis Cluster, and even load some ultra-high-value live broadcast rooms to the designated Redis Cluster cluster, achieve physical isolation, and enjoy" "VIP" service to ensure that the news in each live broadcast room does not affect each other.

For live broadcast rooms with different numbers of live broadcasts, the polling time can also be configured. For example, live broadcasts with a small number of people, and live broadcast rooms with less than 100 people, you can set a relatively high-frequency polling frequency, such as about 1.5s, and more than 300 people. , It can be about 2s if there are less than 1,000 people, and the live room for 10,000 people can be set for about 2.5s. These configurations should be delivered in real time through the configuration center. The client can update the polling time in real time. The frequency of adjustment can be based on the actual user experience in the live room. In combination with the load of the server, find a relatively optimal value of the polling interval.

3.2.1.4 Points to note about short polling

1) The server needs to verify the timestamp passed by the client: This is very important. Just imagine if the audience is watching the live broadcast, the live broadcast will exit the background, the client polling process will be suspended, and the user will resume watching the live broadcast. When the screen is in progress, the time passed by the client will be very old or even expired. This time will cause slow checking when the server queries Redis. If there are a large number of slow server checks, it will cause the server to connect to Redis. Failure to release it quickly will also slow down the performance of the entire server. A large number of polling interface timeouts will occur in an instant, and the quality of service and QoE will drop a lot.

2) client needs to verify duplicate messages : In extreme cases, the client may receive duplicate messages. The reasons may be as follows. At a certain moment, the client sends a request of roomId=888888×tamp=t1. Because the network is unstable or the server GC, the request processing is slow, and it takes more than 2s, but because the polling time is up, the client sends a request with roomId=888888×tamp=t1, and the server returns the same data. It will appear that the client repeatedly renders the same message for display, which will also affect the user experience, so it is necessary for each client to verify the repeated messages.

3) massive data can not be returned to the problem of rendering in real time : Imagine a very hot live broadcast room, when the volume of messages generated per second is thousands or tens of thousands, according to the above storage and query ideas Vulnerable, because we only return 10 to 20 messages each time due to various factors, so it takes a long time for us to return all the hot one-second data, which will cause the latest Messages cannot be returned quickly and preferentially, so messages returned by polling can also be selectively discarded according to message priority.

The benefits of the client polling the service server to query the live broadcast messages are obvious. The distribution of messages is very real-time and accurate. It is difficult to appear in the scene where messages cannot be reached due to network trembling, but the disadvantages are also very obvious. The server is in business. The load pressure during the peak period is very high. If all the messages in the live broadcast room are distributed through polling, it will be difficult for the server to achieve linear growth through horizontal expansion in the long run.

3.2.2 Long connection

3.2.2.1 Architecture model of persistent connection

In terms of the process, as shown in the figure above, the overall live broadcast long connection process:

The mobile phone client first requests the persistent connection server through http to obtain the IP address of the TCP persistent connection. The persistent connection server returns the best connectable IP list according to the routing and load strategy.
According to the IP list returned by the persistent connection server, the mobile phone client makes the connection request of the persistent connection client, and the persistent connection server receives the connection request, and then establishes the connection.
The mobile phone client sends authentication information, performs communication information authentication and identity information confirmation, and finally the long connection is established. The long connection server needs to perform operations such as connection management, heartbeat monitoring, and disconnection reconnection.

The basic architecture diagram of the persistent connection server cluster is shown below. The business is divided according to regions, and terminal machines in different regions are connected on demand;

3.2.2.2 Long connection establishment and management

In order to make messages reach users instantly, efficiently and securely, the live broadcast client and the IM system have established an encrypted full-duplex data channel, which is used for sending and receiving messages. When a large number of users are online, these connections and conversations are maintained. , Need to use a lot of memory and CPU resources.

The IM access layer keeps its functions as simple as possible, and the business logic is submerged in the subsequent logical services for processing. In order to prevent the release, the restart process will cause a large number of external network devices to re-establish connections and affect the user experience. The access layer provides a hot update release solution: connection maintenance, account management and other infrequently changed basic logic are put into the main program, and the business logic is embedded into the program by means of so plug-ins. When the business logic is modified, the plug-in only needs to be reloaded once. That is, it can ensure that the long connection with the device is not affected.

3.2.2.3 Long connection keep-alive

After the long connection is established, if the intermediate network is disconnected, neither the server nor the client can perceive it, resulting in a false online situation. Therefore, a key issue in maintaining this "long connection" is to enable this "long connection" to be notified quickly when there is a problem with the intermediate link, and then to establish a new available connection by reconnecting. So that our long connection has been maintained in a highly available state. IM has enabled the keeplive detection mechanism on the server side and smart heartbeat on the client side.

With the keeplive detection function, you can detect unexpected situations such as client crashes, intermediate network openings, and intermediate devices deleting connection-related connection tables due to timeout, so as to ensure that the server can release the half-open TCP connection when an accident occurs.
The smart heartbeat enabled by the client not only informs the server of the client's survival status and regularly refreshes the IP mapping table of the NAT internal and external networks, but also automatically reconnects long connections when the network changes.

3.2.3 IM message distribution in live room

The overall flow chart of IM long connection distribution message

When integrating the three modules of the client, the IM long connection server module and the live service server module, the overall message distribution logic follows the following basic principles:

Unicast, groupcast, and broadcast messages are all messages that the live service server calls the interface of the IM persistent connection server, and the messages that need to be distributed are distributed to each service live broadcast room.
The service server responds to the events generated in the live broadcast room by corresponding service types, such as gifting and deducting virtual currency, sending public screens for text health verification, etc.
The client accepts the signaling control of the live service server. Whether the message is distributed through the long connection channel or the http short polling distribution is controlled by the live service server. The client shields the details of the way the underlying message is obtained, and the upper layer of the client accepts unified messages Data format, processing and rendering the corresponding business type message.

3.2.3.1 Member management and message distribution in the live broadcast room

The members of the live broadcast room are the most important basic metadata of the live broadcast room. The number of users in a single live broadcast room is actually unlimited, and there are several large live broadcasts (more than 30W online at the same time), several hundred in the middle live broadcast room, and tens of thousands of small live broadcasts. In such a distribution, how to manage the members of the live room is one of the core functions in the system architecture of a live room. There are two common ways as follows:

fixed fragments to the live broadcast room. There is a mapping relationship between users and specific fragments, and users are stored in each fragment relatively randomly.

Using the fixed sharding method, the algorithm is simple to implement, but for a live room with few users, it is possible that the number of users carried by the shard is small, and for a live room with large users, it is possible that the number of users carried by the shard is relatively large, and the fixed shard has natural scaling. The characteristics of poor sex.

2. dynamic fragmentation specifies the number of fragmented users. When the number of users exceeds the threshold, a new fragment is added. The number of fragments can change as the number of users increases.

Dynamic sharding can automatically generate shards based on the number of people in the live broadcast room, and open new ones when it is full, try to make the number of users of each shard reach the threshold, but the number of users of existing shards changes with users entering and leaving the live room, and maintenance complexity Relatively high.

3.2.3.2 Message Distribution in Live Room

There are a variety of messages in the live broadcast room, such as entry and exit messages, text messages, gift messages, and public screen messages. The importance of the messages is different, and the corresponding priority can be set for each message.

Messages of different priorities are placed in different message queues. High-priority messages are sent to the client first. When the message accumulation exceeds the limit, the oldest and low-priority messages are discarded. In addition, the messages in the live broadcast room are real-time messages. It is of little significance for users to obtain historical and offline messages. The messages are stored and organized in a way of reading and spreading. When sending messages in the live broadcast room, according to the message sending service corresponding to the notification of the members in the live broadcast room, the messages are respectively distributed to each user in the corresponding shards. In order to deliver the live broadcast messages to users in real time and efficiently, when When the user has multiple unreceived messages, the delivery service uses batch delivery to send multiple messages to the user.

3.2.3.3 Message compression for long connections

When using TCP long connections to distribute live broadcast messages, you also need to pay attention to the size of the message body. If the number of distributed messages is relatively large at a certain moment, or when the same message is used in a group broadcast scenario, there are more group broadcast users. , The export bandwidth of the computer room of the IM connection layer will become the bottleneck of message distribution. Therefore, how to effectively control the size of each message and compress the size of each message is a problem that we also need to think about. We currently use two methods to optimize the structure of the relevant message body:

Use protobuf protocol data exchange format
Messages of the same type are combined and sent

After our online test, using the protobuf data exchange format, each message saves 43% of the byte size on average, which can greatly help us save the export bandwidth of the computer room.

3.2.3.4 Block Message

The so-called block message is also a technical solution that we learn from other live broadcast platforms, that is, multiple messages are combined and sent, and the live service server does not immediately call the IM persistent connection server cluster to directly distribute the message when it generates a message. The main idea is to take the live broadcast room as the dimension, and distribute the messages generated by the business system during this time period at a uniform time interval every 1s or 2s.

Distribute 10-20 messages per second. If the business server accumulates more than 10-20 messages per second, then discard them according to the priority of the message. If the priority of these 10-20 messages is relatively high, For example, if they are all gift-type messages, send the message in the next message block. The benefits of doing so are as follows:

Combining messages can reduce the transmission of redundant message headers. Multiple messages can be sent together. In the custom TCP transmission protocol, message headers can be shared to further reduce the size of message bytes;
To prevent message storms, the live broadcast service server can conveniently control the speed of message distribution, and will not distribute messages to the live broadcast client indefinitely, and the client cannot handle so many messages;
Friendly user experience. Because the flow rate of the messages in the live broadcast room is normal, the rendering rhythm is relatively uniform, which will bring a good user live broadcast experience, and the whole live broadcast effect will be very smooth

3.3 Message discard

Regardless of http short polling or long connection, when there is a high-value live broadcast room, there will be messages discarded. For example, in a game live broadcast, when there are more exciting moments, the number of public comment screens will increase instantly, and at the same time The messages of sending low-value gifts will also increase a lot in an instant, to show support for their players’ wonderful operations, then the number of messages distributed per second by the server through IM long connections or http short polling will be thousands or tens of thousands. A sudden increase in messages will cause the following problems on the client side;

The messages obtained by the client through the long connection suddenly increase, the downstream bandwidth pressure increases suddenly, and other services may be affected (for example, the svga of the gift cannot be downloaded and played in time);
The client cannot quickly process and render so many gifts and public screen messages, and the CPU pressure will suddenly increase, and audio and video processing will also be affected;
Due to the backlog of messages, there is a possibility that long-expired messages will be displayed, and the user experience (QoE) indicators will decrease.

Therefore, for these reasons, it is necessary for messages to be discarded. To give a simple example, the priority of gifts must be higher than that of public screen messages, and the message of the PK progress bar must be higher than that of the entire network broadcast message. The news of valuable gifts is higher than the news of low-value gifts.

According to these business theories, we can do the following controls in real code development:

Combining specific business characteristics, different levels of messages are assigned to each business type. When message distribution triggers flow control, low-priority messages are selectively discarded according to the message priority.
The message structure has two new fields, creation time and sending time. When actually calling the persistent connection channel, it is necessary to determine whether the current time and the creation time of the message are too large. If it is too large, the message will be discarded directly.
gain message (correction message) , in business development, in the design of the message, try to design the gain message as much as possible. The gain message refers to the subsequent arrival of the message that can contain the previous arrival message, for example, 9:10 The PK value of anchor A and anchor B is 20 to 10, so the PK message value distributed at 9:11 is 22 to 10, and the incremental message 2:0 cannot be distributed. I hope the client will accumulate the PK items ( 20+2: 10+0), but there are messages that are discarded due to network tremor or pre-message discarding, so distributing gain messages or correcting messages can help the business return to normal.

Fourth, write at the end

With any live broadcast system, as the business develops and the popularity of the live broadcast room continues to increase, the problems and challenges encountered by the messaging system will follow, whether it is the message storm of long connections or the massive HTTP short polling requests. Will lead to a sharp increase in server pressure, which is what we need to continuously solve and optimize. We must aim at the business characteristics of each period, make continuous upgrades of live news, and make evolvable message modules to ensure that the ability of message distribution can ensure the continuous development of the business.

Vivo’s live messaging module is also gradually evolving. The driving force of the evolution is mainly due to the development of services. With the diversification of business forms, the number of users watching more and more will gradually increase, and the functions of the system will gradually increase. A kind of performance bottleneck. In order to solve the actual performance problems, we will analyze the code and the interface performance bottleneck one by one, and then give the corresponding solution or decoupling scheme. The message module is no exception. I hope this article can help you Bring design inspiration for related live message modules.

Author: vivo Internet Technology-LinDu, Li Guolin