Qiniuyun Huo Kai: Real-time audio and video SDK design practice

This article is based on the sharing of Huo Kai (Qiniu Cloud Audio and Video Client Architect) on "ECUG Meetup Phase 1 | 2021 Audio and Video Technology Best Practices·Hangzhou Station" on June 26, 2021.

To get the "full version of the lecturer PPT", please add the ECUG assistant WeChat (WeChat ID: ECUGCON) and note "ECUG PPT". The sharing of other lecturers will also be released in the future, so stay tuned.

following is the sharing text:

Hello everyone, let me introduce myself first. My name is Huo Kai and I am the audio and video client architect of Qiniu Cloud. I joined Qiniu Cloud in 2017 and led and participated in the development of multiple audio and video related SDKs such as short videos, push streaming, and players. The current work is mainly focused In the design and development of real-time audio and video SDK, the main energy is still on the client.

This is what I will tell you today:

1) The technology and challenges of real-time audio and video SDK
2) Qiniu real-time audio and video SDK architecture
3) Practical experience of real-time audio and video SDK

Simply put, it is to share the problems we often encounter in the process of developing real-time audio and video SDK, and how to optimize it. It can be said that some of our practical experience.

1. The technology and challenges of real-time audio and video SDK

Today is the audio and video session. I believe that everyone here must have knowledge or experience in audio and video technology, but I don’t know how many people have ever made or designed the SDK. In fact, it also has its own unique challenges.

1. Customer demand for real-time audio and video SDK

As a company that provides technical services, Qiniu Cloud often hears a lot of customer voices. For example, customers often ask, how much bit rate should be configured for the video, why there is no effect after a certain interface is called, and how to access third-party beauty And other issues.

In addition, some customers may encounter some abnormal phenomena during use, such as blurry screen, black screen, failure to join the room, etc.

Also, customers often put forward some functional requirements for us. For example, some customers want to publish the video content of both the camera and the mobile phone screen at the same time, or want to obtain the video or audio frame data of each channel, and so on.

After encountering these problems, we follow the steps of discovering, analyzing, and solving problems, and solve the problems bit by bit. First of all, let us calm down and analyze the demands behind these problems. In short, we have summarized three points:

1) The audio and video technologies of the accessors vary, but without exception, they all hope to be able to access quickly.

We can't force customers' audio and video technology to reach a certain level before they can access this SDK, so we can only ask ourselves: The SDK must be designed to be extremely simple and easy to use, so that users can access it as quickly as possible.

2) The use scene and environment of the access person are complex and changeable, but the requirements for audio and video experience are very high.

For example, the video conference scene has high requirements for delay, and it will be obvious that the other party's speech is delayed if it exceeds 300 milliseconds. In the live broadcast scenario, users have very high requirements for clarity. An anchor may face a large number of viewers, and the anchor’s picture is not clear, which has a great impact on the audience. For another example, the same live broadcast scene, outdoor live broadcast and indoor live broadcast are two completely different environments, but we must ensure that we can give users the best audio and video experience in the existing environment.

3) Accessors need a "sense of security", and have requirements for service stability, monitoring methods, and troubleshooting capabilities.

Accessors need to truly perceive the use of online users, including the real-time quality of audio and video, errors or exceptions that occur, and once a problem occurs, whether it is the SDK or the user's posture, they must be able to Troubleshoot quickly to minimize the impact.

2. Core requirements of real-time audio and video SDK

According to the demands of customers, we can define the specifications of an excellent real-time audio and video SDK:

1) The interface is simple and the boundary is clear.

Don't let the accessor feel ambiguous when calling our API, the interface must be simple and clear, so as to facilitate the user to call.

2) Strong scalability and good ecologicalization.

On the basis of the SDK, functions can be easily extended, such as advanced beauty, face review, voice-to-text, etc. Our approach is to make different plug-ins for users based on different scenarios at the upper level of the SDK to maximize convenience for users to expand.

3) Abstract and optimize according to the scene, and expand the function.

As mentioned earlier, the requirements for audio and video quality are different for video conferencing and live broadcast scenarios, so we should optimize for different scenarios and provide core functions that cover that scenario as much as possible.

4) The audio and video experience of the first frame, freeze, delay, echo, etc. is optimized to the extreme.

These are the core indicators that affect the audio and video experience, so in terms of QoS optimization, no matter how much effort we make, we will not go too far.

5) Stable and reliable service, rich data embedding, visual data monitoring and analysis.

Reliability is a prerequisite for technical services, and data is a prerequisite for ensuring reliability.

3. Technical difficulties of real-time audio and video SDK

Having said so much, everyone can also perceive that it is still very difficult to make a real-time audio and video SDK. In more detail, what are its technical difficulties? Let me give you a few examples:

Including the collection, coding and transmission of audio and video. There are also video processing and audio processing, such as beauty filters and audio mixing. As well as the optimization of weak networks, data management reporting, crash analysis, audio 3A algorithms, and so on.

In addition, there are compatibility adaptation, performance optimization and other aspects. These are the technical difficulties that we need to face and solve. Later, I will briefly introduce some of our experiences for you in the third part of the practice.

2. Architecture of Qiniu Cloud Real-time Audio and Video SDK

Next, let me introduce to you how the Qiniu Cloud Real-time Audio and Video SDK is designed. First, let's take a look at the iterative history of Qiniu Cloud Real-time Audio and Video SDK.

In 2018, we launched version 1.0, which supports core audio and video communication functions. After that, we found that more and more customers will publish more than one stream in the room, such as screen content and camera content, so we are here The 2.0 version supports multiple tracks, which is a solution that supports users to publish multiple streams. Going forward to 3.0, we want to make everyone experience the best audio and video experience in the current network environment, so we have made a big and small stream strategy.

In addition, we not only did the normal iteration of the SDK, but also released some supporting solutions, such as the video conference program, the interactive live broadcast program, and the online interview program. In addition, some plug-ins are also provided, such as the beauty plug-in, which is convenient for customers to access the beauty SDK, and the whiteboard plug-in to be launched soon, which is convenient for users in education scenes to access.

1. Qiniu cloud real-time audio and video SDK module division

This is the module division of Qiniu Cloud Real-time Audio and Video SDK:

The bottom layer is some external libraries that we rely on. For example, we implement SDK communication through WebRTC, signaling transmission through WebSocket, and other self-developed ones like HappyDNS, QNBeautyLib, QNEncoderLib, QNAECLib, etc.

Going to the upper layer, we have made some packages based on the business module on the basis of the bottom layer, including:

CameraManager, responsible for camera collection, rotation, exposure value, focus and other functions.
MicrophoneManager is responsible for the collection of microphones.
RenderProcesser is responsible for video processing, such as watermarking, cropping, beautification, filters and other functions.
AudioProcesser, responsible for audio processing, such as audio resampling, mixing and other functions.
RoomManager is responsible for core functions such as adding rooms, publishing, and subscribing.
CrashReporter is responsible for reporting the stack information in time when a crash occurs.

On the next level, we provide the core API, as well as advanced beauty, whiteboard and other plug-ins.

The top layer is the user's own business layer.

2. Qiniu cloud real-time audio and video SDK data flow

Next, I will briefly introduce how the data flow of Qiniu Cloud Real-time Audio and Video SDK is.

1) Collect data: Collect video data from the camera or screen, and collect audio data from the microphone. The collected video data includes YUV and texture data, and audio PCM data.
2) Data processing: After the collection, it is sent to the audio processing and video processing modules. The video can be processed for beauty, watermarking, mirroring, etc., and the audio can be processed for resampling and mixing.
3) Encoding: After processing, send the data to the video encoder and audio encoder, you can choose soft or hard encoding. The output after encoding is H.264 and Opus packets.
4) Encapsulation: Send the encoded data to the audio and video packaging module for packaging.
5) Upload: Transmit the encapsulated data packet to our streaming media server through the RTP protocol.
6) Forwarding: The streaming media server forwards and transmits the data to the subscribers in the room.

The process of the subscribing end and the publishing end happens to be reversed, through decapsulation, decoding, audio and video processing, and finally rendering!

3. Practical experience of real-time audio and video SDK

Next, I will introduce some of our practical experience in real-time audio and video SDK.

1. Extensible beauty plugin

First of all, I will introduce our extensible beauty plug-in. Many users find it difficult to integrate with the audio and video SDK when accessing the Beauty SDK. Because you may need to use OpenGL to do some preprocessing of the video frame before using the Beauty SDK, which is more difficult for users, we made a plug-in between the Beauty SDK and the RTC SDK, as shown below:

First, we use the RTC SDK to collect the camera, and then give the collected data to the beauty plug-in layer. What things does the beauty plug-in handle? Including the loading of beauty special effects resources, or converting OES textures to 2D textures, or correcting the angle of textures captured by the camera. We input the texture data that has been processed by the plug-in and meet the specifications of the Beauty SDK. After processing the beauty, makeup, filters, stickers, etc. of the Beauty SDK, we will return the texture to the RTC SDK after processing. , And finally preview, encode and transmit.

This is the detail of our internal implementation. The internal processing is quite complicated, but it is actually very simple externally. Externally, we only provide the simplest interface, such as setBeauty, setSticker, setFilter, etc., so that users can reduce access costs.

2. Large and small stream strategy

Next, I will introduce the strategy of large and small streams.

Why is there a big and small flow strategy? We want to achieve the best audio and video experience in each user's network environment. For example: User A publishes a video. This video will be sent to the encoder and will be three channels, which are three channels of high, medium, and low resolution videos, and then sent to the streaming server.

When the streaming media server forwards, it will subscribe according to the network bandwidth of users B, C, and D. For example, if user B has a better network environment, he will directly subscribe to the highest resolution channel; D's network environment is not good, and in a weak network environment, he will subscribe to the lowest resolution. If user B's network environment changes, and he subscribes to the high resolution at the beginning, and then the network environment deteriorates, it will automatically fall back to the low resolution, and the user will not experience stuttering and can maintain fluency.

3. QoS optimization

QoS optimization is a very important position in the entire audio and video experience, no matter how emphasized it cannot be overemphasized. On the basis of WebRTC, we mainly optimize from the following aspects.

1) Bandwidth estimation: We mainly use algorithms such as GCC and BBR for bandwidth estimation.
2) Anti-packet loss: In terms of anti-packet loss, we mainly optimized the intelligent combination of data redundancy and packet loss retransmission.
3) Smooth jitter: Smooth jitter includes optimization of Neteq, Jitterbuffer, etc.
4) Bandwidth allocation: Bandwidth allocation includes strategies such as audio priority, video layering, and transmission taking into account both uplink and downlink, or the corresponding allocation strategy can be formulated in combination with user scenarios and services

4. Echo cancellation optimization

Next, I will introduce to you our optimization of echo cancellation through two scenes.

The above two spectrograms on the left and right represent two scenarios respectively, each of which has three rows of data:

The first line: the sound signal image of the original sound;
The second line: use WebRTC's built-in echo cancellation to filter the image;
The third line: the filtered image with echo cancellation developed by Qiniuyun.

The scene in the picture on the left is when a person is talking, a particularly noisy background music will be put in the background, and we want to eliminate the background music behind. As you can see, in the middle row, you can still see the obvious continuous horizontal line below, which is the residual music sound image. In the third row from the left, you can clearly see that these music sounds are eliminated without any residue, and the elimination is relatively complete.

The picture on the right is a voice call scene, mainly in dual-talk mode: A is talking, and suddenly the other party B also starts to talk. At this time, we don't want to cancel the voice of user A. But with the echo cancellation algorithm that comes with WebRTC, some sounds of A will be filtered out, and from the third row of the right figure, we can see that the echo cancellation algorithm developed by us has a much better "word-eating" phenomenon.

5. Compatibility optimization

Compatibility is a very troublesome thing we encountered. For example, the phenomenon of blurry or echo on a certain machine mentioned by the user on the first page can be attributed to the issue of compatibility. The optimization of compatibility, in general, we are divided into two strategies:

The first strategy is the dynamic switching strategy. During operation, the encoder can be switched dynamically.

For example, in the encoding configuration we set, when the hard encoder is turned on on a certain mobile phone, there will be an exception of opening failure. At this time, we must first catch the exception, and then switch to the soft encoder when the user is unaware. Make this function work normally.

Another example, when a user uses a soft encoder, the encoding efficiency on a certain mobile phone is extremely low, and the FPS is far lower than our expected value. At this time, we will also open the hard encoder for comparison and see which encoder The efficiency is better, so that the most suitable encoder is dynamically switched for encoding when the user is unaware.

The second strategy is the whitelist strategy. When initializing the RTC SDK, the server must automatically issue a configuration whitelist based on some of the detected devices.

For example, a certain Android phone, in a hard-coded scenario, if the resolution is not a multiple of 16, it will appear to be screened. We summarized these models and put them on our whitelist of configuration. When this model is detected, we adjust the resolution to a multiple of 16, or switch the mobile phone’s hard encoder to soft The encoder performs encoding, so as to solve the compatibility problem of the flower screen.

For another example, when the sampling rate is 48K, we found that some mobile phones have very poor echo cancellation effects, so when the device is detected, the 48K sampling rate can be adjusted to 16K, or switched to self-developed echo Elimination strategy to solve the compatibility problem of echo.

6. Data collection and reporting

For the data collection and reporting module, we put forward three requirements:

1) Real-time reporting. RTC SDK cannot report after the entire operation is completed, but needs to report some data in real time.

2) The action is restored. It is necessary to be able to truly restore the user's invocation of the SDK through the collected SDK logs. For example, on the first page, a customer mentioned why calling a certain interface has no effect, or why it fails to enter the room. We often help users find the order of their interface calls or the wrong parameters through the reported log restoration, so as to quickly help customers Troubleshoot.

3) Module isolation. The data collection and reporting module must not affect the normal audio and video communication module, that is, it cannot affect the main business module. Even if there are some abnormalities, these abnormalities must be captured, and there must be no problems with the collection rate reporting, which may cause users This service is not available.

What content do we collect? First of all, I declare that Qiniu Cloud will not collect users' private data, which is absolutely not allowed. From a broad perspective, it includes two parts: SDK basic information and audio and video quality information.

SDK basic information: including SDK call logs, error and crash information, setting model information, SDK internal status and other information.
Audio and video quality information: including the first frame time, real-time frame rate, real-time bit rate, real-time packet loss rate and other information.

7. Data monitoring and analysis

Finally, we need to visually monitor and analyze the collected data.

The example in this picture is the action restoration I just mentioned. Through this picture, we can clearly see which interfaces the user has called, what the calling sequence is, and what the internal state is like. From initialization, to joining the room, publishing, subscribing, and finally exiting the room, we can see clearly in the background, as long as the user uses the wrong posture, we can immediately perceive it.

This picture is the real-time bit rate data of a certain stream.

Because it is reported in real time, we can see the current bit rate, frame rate, packet loss rate, rtt and other core indicators that affect the audio and video experience. When these curves change significantly, we can set some thresholds Do some thresholds to trigger the alarm mechanism.

Q & A

Question: Today’s topic is about audio and video, but we may usually be exposed to camera facial recognition. Can we use SDK technology?

Huo Kai: Based on the real-time audio and video SDK, we will provide a set of plug-ins including face recognition, live detection, voice-to-text and other functions. Behind this set of plug-ins are connected to our Qiniu Cloud’s intelligent multimedia services for convenience. The user implements face recognition and other related functions during audio and video calls.

Question: The hard-coding capabilities of different mobile phone models are different, and the underlying technologies implemented by each manufacturer are different. You mentioned some compatibility issues, but in actual implementation, hard-coding depends on the manufacturer. Encoder capability, so it must maintain the bit rate when encoding, but some hard-coded encoders cannot do it. So, does Qiniu Cloud recommend hard editing first, or recommend soft editing first?

Huo Kai: We recommend using hard coding first, because hard coding is more efficient. However, there is a premise that when a hard encoder has a problem, it needs to be immediately sensed, and after it is sensed, it will automatically switch to soft editing through a dynamic switching strategy.

About Qiniu Cloud, ECUG and ECUG Meetup

Qiniu Cloud: Qiniu Cloud was established in 2011. As a well-known domestic cloud computing and data service provider, Qiniu Cloud continues to be intelligent in massive file storage, CDN content distribution, video-on-demand, interactive live broadcast and large-scale heterogeneous data In-depth investment in core technologies in the fields of analysis and processing is committed to fully driving the digital future with data technology, and empowering all walks of life to fully enter the data age.

ECUG: Fully known as Effective Cloud User Group (Effective Cloud User Group), CN Erlounge II was established in 2007 and initiated by Xu Shiwei. It is an indispensable high-end frontier group in the field of science and technology. As a window of the industry's technological progress, ECUG brings together many technical people, pays attention to current hot technologies and cutting-edge practices, and jointly leads the technological transformation of the industry.

ECUG Meetup: A series of technology sharing activities jointly created by ECUG and Qiniu Cloud. It is positioned as an offline gathering for developers and technical practitioners. The goal is to create a high-quality learning and social platform for developers. We look forward to every participant. The co-creation, co-construction, and mutual influence of knowledge between developers will generate new knowledge to promote cognitive development and technological progress, promote common progress in the industry through communication, and create a better communication platform and development space for developers and technology practitioners.

Qiniuyun Huo Kai: Real-time audio and video SDK design practice

1. The technology and challenges of real-time audio and video SDK

1. Customer demand for real-time audio and video SDK

2. Core requirements of real-time audio and video SDK

3. Technical difficulties of real-time audio and video SDK

2. Architecture of Qiniu Cloud Real-time Audio and Video SDK

1. Qiniu cloud real-time audio and video SDK module division

2. Qiniu cloud real-time audio and video SDK data flow

3. Practical experience of real-time audio and video SDK

1. Extensible beauty plugin

2. Large and small stream strategy

3. QoS optimization

4. Echo cancellation optimization

5. Compatibility optimization

6. Data collection and reporting

7. Data monitoring and analysis

Q & A

七牛云

引用和评论

AI for All，Code for All｜七牛云 AI 开源项目扶持计划全面启动

三分钟掌握视频剪辑 | 在 Rust 中优雅地集成 FFmpeg

三分钟掌握音视频处理 | 在 Rust 中优雅地集成 FFmpeg

三分钟掌握视频分辨率修改 | 在 Rust 中优雅地使用 FFmpeg

CVPR 2025 | 火山引擎获得NTIRE 视频质量评价挑战赛全球第一

三分钟掌握音视频信息查询 | 在 Rust 中优雅地集成 FFmpeg

【harmonyOS NEXT 下的前端开发者】WAV音频编码实现