算法 - Technical Practice | Scene-oriented audio and video call experience optimization original - 个人文章

(click to register)

In modern life, audio and video calls are one of the most commonly used communication methods. For example, one-on-one audio and video in social networking, remote consultation in medical care, online viewing in real estate transactions, and one-to-one and group audio and video communication that may occur anytime, anywhere in remote work scenarios. *This article is transferred from the public account [Rongyun Global Internet Communication Cloud], reply to the lottery and get benefits.

And we have experienced the "stuck" of product logic in use, such as audio and video cannot be switched freely, 1V1 cannot be upgraded to group chats, and group videos cannot be joined at any time after exiting the group.

On June 16th, Rongyun RTC Advanced Practical Master Course focused on audio and video calls. From the realization of audio and video calls, the pain points of multi-scenario experience and the best practices of Rongyun New CallLib, it disassembled audio and video calls in various applications. Experience challenges faced in scenarios and share optimization solutions. Reply to [call] in the background to get the complete courseware

Audio and video call implementation

The audio and video calls we usually say refer to application scenarios such as WeChat that must contain call flow. There is one caller and one or more callees. The caller initiates a call, and the callee can choose to answer or hang up.

There are many use cases for audio and video calls, especially when we are going through a major digital transformation.

For example, for remote consultation, doctors can communicate with patients through audio and video calls to facilitate diagnosis; for VR viewing, real estate agents communicate with tenants through audio and video calls, and combine VR to achieve remote viewing; online blind dates, matchmakers can Group chat allows blind date objects to communicate, which is a group audio and video call usage scenario.

(Audio and video call usage scenarios)

Audio and video calls appear in all aspects of our online life, and are also a necessary capability for various applications. So, how to implement an audio and video call?

From the perspective of the basic knowledge that needs to be mastered, it is roughly divided into three parts, audio and video, network transmission and server. This is a very complex system and only a general description is given here.

(RTC Basics)

Audio and video

Basic knowledge of audio and video: the collection of audio and video, different platforms have different collection methods. Developers need to master the basic knowledge of metadata, that is, the data format we get directly through collection.

Audio and video data processing: to master audio noise reduction, audio echo cancellation, image cropping, etc.

Audio and video codecs: At least one audio codec and one video codec must be mastered.

There are corresponding decoders for the encoding format, and hardware decoding can also be used for more general encoding formats, and soft or hard decoding can be selected according to different solutions. Soft solution has high compatibility, but will consume CPU performance, and hard solution has good performance, but may face some compatibility problems.

Audio and video playback and rendering must also be mastered. Usually, playback and rendering refer to the playback and rendering of metadata. Audio is actually PCM playback. If the video is a client, it will use OpenGL, and the browser needs to use WebGL, etc. .

network transmission

The client can choose TCP or UDP. Usually, the audio and video are transmitted using UDP. This is because the audio and video data requires real-time rather than integrity, and the audio and video data can be played normally even if the data is incomplete.

Both TCP and QUIC are reliable transmission protocols , and TCP or QUIC are more used when integrity is required, such as business signaling data.

The QUIC protocol is an integrity-guaranteed protocol built on top of the UDP protocol. UDP itself is an unreliable protocol. If you want to ensure integrity, you need to make a packet loss retransmission policy yourself, while QUIC is a set of data protocols with a packet loss retransmission policy, which can achieve the same effect as TCP.

server

Usually divided into signaling server and audio and video server . As the name implies, the signaling server is responsible for transmitting service-related data, and the audio and video server is responsible for transmitting audio and video data. Different technical solutions correspond to different technologies for implementing servers. Using the above basic knowledge, we can complete the core logic of the audio and video call scene. Before considering the business processing that is strongly related to the call scenario, the following two problems must be solved first.

The first is the instant transfer of business data . To initiate a call, there are two clients: the calling end and the called end. How does the called end receive the initiation signaling sent by the calling end? It can be implemented in the form of Push + long link, or IM SDK provided by a third party.

Next is the more critical basic audio and video capabilities , as well as the real-time transmission capability of audio and video data .

If developers develop their own, they can choose WebRTC, which is a complete open source solution provided by Google with basic audio and video capabilities and transmission capabilities; or they can develop a complete RTC system from scratch. Of course, no matter which self-research plan you choose, there is a very complex underlying knowledge that needs to be learned, and considerable R&D capabilities and staffing are required.

Rongyun provides developers with another more convenient implementation solution, Rongyun CallLib.

For audio and video call scenarios, Rongyun integrates and encapsulates IM and RTC capabilities, and provides an SDK that includes a complete call process. Developers only need to care about a small number of interfaces provided by CallLib to implement call requirements, and have the following advantages.

Integrity , complete call process, support for single-person and multi-person calls, and developers do not need to care about the underlying implementation principles.

Ease of use , out of the box, fast implementation, flexible customization.

Stability , 100% reliable arrival of IM messages, 80% anti-packet loss for RTC audio, and 60% anti-packet loss for video.

The diversity of scenarios and strong scalability can meet the needs of audio and video calls in multiple fields.

Based on Rongyun CallLib, developers can quickly implement an app with calling function, and we have also upgraded and optimized the product in response to the stuck situation of the usage scenario we mentioned at the beginning.

Multi-scenario call experience optimization requirements

call escalation

Upgrade and upgrade of audio and video : Users can directly switch to video calls when making audio calls.

1V1 upgrade to group chat : from 1V1 single chat directly to multi-person discussion group chat.

Join an open call at any time : When a group call is in progress, if the user is not connected immediately, as long as the call is not over, he or she can choose to join.

flow control upgrade

Free audio and video stream publishing API; preview multiple calls; select answer and call waiting.

It mainly optimizes the API of flow control, so that developers can use it conveniently and develop some high-level functions according to their own needs.

Data Consistency Upgrade

Including the consistency of call timing, the consistency of participant state synchronization, and the consistency of operation business logic in extreme scenarios.

The upgrade of data consistency is mainly to bring a more consistent experience to all users participating in audio and video calls, especially users who access calls through different terminals, and to facilitate later business expansion.

So, how to achieve these optimizations and upgrades?

Rongyun CallLib is implemented based on IMLib and RTCLib, and is a client-heavy design.

All states and business logic need to be stored on the client side, which increases the complexity of the implementation of the client side. The business logic needs to be repeatedly implemented on each side, which will also lead to inconsistencies in the state in extreme cases.

For example, the caller and the called party click to hang up at the same time. The caller hangs up to cancel the call, and the called party hangs up and rejects the call, which is reflected in the call records differently. If the caller and the called hang up at the same time, since this operation is sent and processed from the client, it is impossible to distinguish who did the operation first, resulting in inaccurate call records.

In fact, this problem can be handled, but it will be very complicated, because the design of heavy client is difficult to achieve functional expansion.

Rongyun New CallLib Practice

Rongyun New CallLib can elegantly solve the above problems.

Upgrade business processing capabilities : focus on the server side and light client side design.

The state that the original CallLib was responsible for maintaining was changed to be maintained and saved on the server side; CallLib was transformed into a state machine model driven by Server;

It greatly simplifies the implementation complexity of the client, and can avoid the problem of inconsistent implementation of multi-terminal logic.

Because state changes are issued by the server, the consistency of the state is guaranteed, and it is also easy to handle extreme scenarios and online and offline states. For example, if the two ends mentioned above make a hang-up operation at the same time, there must be an operation that reaches the server first, and then the server sends the status, which is more orderly. The generation of the call record is also completed by the server, so there will be no inconsistency between the two ends.

Let’s take a look at the specific scene below, how the new design can achieve the experience and then upgrade.

1V1 audio and video call call flow

This is our most basic calling function. On the right is the complete sequence diagram of this basic capability. It is completed by four roles. Client A and Client B represent the calling and called ends respectively. Call Server handles call-related business logic. , RTC Server handles audio and video calls.

As can be seen from the figure, the process of implementing this basic capability is actually very complicated.

Upgrading and downgrading of audio and video

I believe that many people have had such an experience. When you need to turn on the camera for some reason during a voice call, you must hang up the audio call and start the video again.

Rongyun has optimized this scenario to make it more convenient for users to use in this scenario. When I need to temporarily upgrade an audio call to a video call, I can initiate an upgrade request within the call. The initiating end can choose to accept or reject it. If it is accepted, it will be upgraded to a video call, and if it is rejected, the audio call will continue. Of course, the initiator can also choose to cancel the upgrade.

In this scenario, there may also be some extreme operations, such as the cancel operation of the initiator and the acceptance of the initiator to be clicked at the same time. At this time, CallServer will play a decision-making role. For example, we take the cancellation as the standard. Clicking accept on both ends still relegates both ends to an audio call. In the previous heavy-client design, this judgment would become very complicated.

Incoming call during a call

In our common audio and video call scenarios, if the called user is on a voice or video call, it usually shows that the other party is busy and needs to wait for him to hang up.

In analogy to the phone scenario, we may receive a call from a third person during the call. At this time, the operator's phone has the function of choosing to answer and call waiting, which means that we can handle two calls at the same time. For this scenario, Rongyun's CallServer stores the status of each call, so even if it is already on a call, it still has the ability to handle other calls.

Group chat call flow

The difference between a group chat and a single chat lies in two points. First, there may be two or more participants. Second, as long as there is one person in the call, the call will not end.

Therefore, as long as the group chat call is initiated, the initiator will enter the RTC room as soon as possible and wait for others to join. Another special point is that you can invite others to join the room at any time during the process, and the process is very similar to initiation.

In our commonly used communication software, group chats can only occur in a certain group, while Rongyun service can support initiation and invitation operations from different groups or private address books anytime, anywhere.

1V1 upgrade group chat

This scenario occurs when the two are talking, and it may be temporarily discovered that other people need to participate in the communication. Rongyun supports directly upgrading a 1V1 call to a group call.

The upgrade process is similar to the process of inviting others during a group call. The difference is that when upgrading to a group chat, the Call Server will clearly inform the 1V1 participants that the current call has become a group call. The follow-up process after the upgrade is successful is the same as the group chat.

Join at any time if the call is not ended

This is also a common situation - when you receive a group call, it is inconvenient to answer, or you need to leave temporarily after a period of time. The group call will not end as long as one person is present, so users who have not participated in the group chat for the time being can choose to join the call again at any time before the end.

This is a particularly practical optimization, especially in our increasingly online office and communication scenarios, if you have to leave temporarily, you need to join the communication to continue the previous discussion. This is also due to the Call Server's storage of the call status, so that the client can obtain the status of not joining the call at any time.

Call record implementation

Previously, our call records were stored locally, and it was difficult to achieve multi-terminal synchronization.

Now it is possible to update and delete call records, and even the unread count of call records can be synchronized at multiple ends, bringing users a more reasonable and unified experience. Experience optimization is a long-term topic, and audio and video calls are both very common and extremely complex functions. Based on years of profound technical accumulation in the communication industry, Rongyun will continue to optimize the experience down to the smallest detail, eliminate "stuck" in use, and provide developers with solutions that are more convenient and in line with user habits.

Technical Practice | Scene-oriented audio and video call experience optimization original